This article describes a significant enhancement of how InterSystems IRIS deals with table statistics, a crucial element for IRIS SQL processing, in the 2025.2 release. We'll start with a brief refresher on what table statistics are, how they are used, and why we needed this enhancement. Then, we'll dive into the details of the new infrastructure for collecting and saving table statistics, after which we'll zoom in onto what the change means in practice for your applications. We'll end with a few additional notes on patterns enabled by the new model, and look forward to the follow-on phases of this initial delivery.
Table Statistics 101
SQL stands for Structured Query Language. SQL is a language in which you express what you want (the query), rather than how you want it (the code to fetch the query result). It is up to a database's query engine to consider all options and pick the presumably optimal plan to retrieve the result you're asking for. In order to figure out the optimal plan, the query optimizer needs essentially two things: knowledge about your table structure, such as where your data is stored and what indices are available, and knowledge about the table data itself: table statistics. These table statistics cover information such as the estimated size of the table, and the selectivity of every field, which expresses how large a fraction of that table corresponds, on average, to one particular field value. These are critical for making good decisions about query plans. Say you need to find the number of female customers in one particular zip code, and can choose between starting from an index on gender or one on zip code. The selectivity of the gender index will be around 50%, and the one on zip code may be as small as 1% or less, meaning you'll filter the set of rows to look up much more quickly by starting from the latter.
While the table structure is understandably part of your application and can be packaged as part of the application code, that isn't necessarily the case for table statistics. If the previous example is taken from a small bookstore application, it's likely the field selectivity values hold for every bookstore the application gets deployed to, but if it is an electronic health record system, the selectivity of the gender field will differ between a maternity clinic and a general practice, and the number of zip codes (and therefore their selectivity) will depend on the size of the area the hospital serves. Clearly, you want your statistics and therefore query plans to be based on the actual data in the environment the application is deployed to.
Table statistics on IRIS - before 2025.2
On IRIS, table statistics historically have been stored as part of a class' definition. This has a number of drawbacks
- As described, data distributions may vary significantly between the environments in which you're deploying your application. This is not just true for estimated table size and selectivity, but even more so for other, more advanced types of statistics such as outlier information and histograms.
- When deploying an updated version of your application, you would usually want to keep any existing statistics – assuming they are based on that environment's actual data. This is trickier when they are stored as part of the code, and obviously you don't want to accidentally deploy the table stats from your dev or test environment to a customer system.
- When application code is deployed in the form of a read-only database, or when it is part of an IRIS system library (like Interoperability messages), there is no good way to update those tables' statistics whatsoever.
See also this earlier article that goes into a little more detail on the above. These are some of the reasons why we set out to redesign how we manage table statistics on IRIS, and the new infrastructure is now included with InterSystems IRIS 2025.2.
What's changing?
Collected statistics vs Fixed statistics
As described in the examples before, storing statistics as part of application code only makes sense in very specific, controlled environments. In most cases, it is preferable to keep statistics with the data they were collected from, rather than fix them as part of the code, so that's what we're doing in the new model. We distinguish between collected statistics that are always based on actual data, and stored in the extent index (a global with other registry-style details about your table data, also referred to as extents) and fixed statistics that are hardcoded into the class definition by the application developer, and may be based on actual data or some educated assumptions. As you can guess, fixed statistics correspond to the pre-2025.2 model for managing table statistics.
Use the new COLLECT STATISTICS FOR TABLE t
command (which is 100% synonymous with the existing TUNE TABLE
one, and just us standardizing terminology) to populate a new set of statistics based on the current data in your table. Table statistics are tiny, so we actually won't overwrite the previous stats, but rather store a new set and keep the older statistics around for informational purposes until they meet default purge criteria, or are purged explicitly. If you'd like to switch to the fixed model, you can save your collected stats to the class definition using ALTER TABLE t FIX STATISTICS
. This may be part of your application packaging procedure, if you're confident those stats will match your target environment and you prefer the static model.
By default, fixed statistics take precedence over collected statistics. This ensures backwards compatibility, and means as a developer you still have the final say. In fact, having any fixed statistic, even for just one field, means collected statistics will be ignored, and we'll fall back to safe estimates based on the field's datatype for any fields that don't have fixed statistics. If you want to test with collected stats without dropping your fixed stats, you can include the %NOFIXEDSTATS
hint in the FROM clause of your query text (per table) to obtain the query plan solely based on collected stats, or use ALTER TABLE t SET RUNTIME IGNOREFIXEDSTATS = TRUE
to set this behaviour at the table level.
Automatic collection of table statistics
The introduction of collected statistics already addresses many of the packaging and deployment inconveniences pointed out in the introduction. However, it does not really do anything about one of the most common problems we see with customers: the lack of up-to-date statistics. As described in the introduction, accurate statistics are key to getting the best query plans for your specific data distribution. If you never bothered to gather any, the optimizer will use crude defaults based on your fields' datatype, but these are unlikely to properly account for the specifics of your application in all cases. Starting with 2021.2, IRIS would automatically collect stats for tables at query time that don't have any and are eligible for a fast sampling technique, but that would only happen once, and might be at a time when your table has not yet been populated with representative data. But even when statistics were collected at a reasonable moment, data distributions shift over time (notably histograms) and we very often see customers getting suboptimal query plans simply because they are based on stale statistics.
With 2025.2, we're introducing a new system task that runs during a nightly maintenance window, and for each namespace, considers all tables that have no statistics at all, for which the most recent stats have been invalidated (for example after making changes to the table's storage definition and recompiling), or for which at least 25% of the table data has changed since stats were last collected (approximately, this is measured based on the combined ROWCOUNT for DML operations against that table). Statistics are then collected for these tables one by one, or until the maximum duration for the maintenance task (configurable) is reached. Other configuration options include whether to consider tables with fixed statistics (ON by default), tables that map to remote storage (OFF by default), and tables that are not eligible for a fast sampling technique (OFF by default). If needed, you can mark individual tables using a SKIPAUTOMATICSTATSCOLLECTION
class parameter (or equivalent runtime flag) such that they are skipped by this system task (note that this new flag also affects the pre-existing AutoTune behaviour).
In 2025.2, this system task is OFF by default, as we want to assess with customers the impact this has on IO patterns for real-world systems. We encourage customers to enable the system task and get in touch to share their experiences with respect to system load so we can make any necessary adjustments before switching it ON by default in a future release. At that time, we intend to retire the earlier AutoTune behaviour that might kick in during query processing.
Manual collection of table statistics is of course still supported, and running COLLECT STATISTICS
after major data loads or purges is still a best practice.
What does this mean for your application?
We've taken care to design the new model such that for SQL applications that upgrade to IRIS 2025.2, nothing really changes. Pre-existing statistics will now be treated as fixed statistics, which take precedence over any statistics that would be collected automatically in the background. We advise that customers upgrading to the new model consider the following three recommendations:
Drop fixed statistics after upgrading
If your application or in fact any table had statisticsbefore upgrading (which would be the case for every table if you followed best practices), consider dropping them outright. Post-upgrade, they will be considered fixed statistics and take precedence over any stats you collect explicitly (maybe through some custom scripts calling TUNE TABLE
) or implicitly. Fixed table statistics can be dropped using a new ALTER TABLE t DROP FIXED STATISTICS
command, or the equivalent ALTER SCHEMA s DROP FIXED STATISTICS
. You can use the %NOFIXEDSTATS
hint to verify the impact on individual queries upfront, as you prefer.
Only when you believe the predefined stats must remain in place to maintain the current query plans, and no changes to data distributions are expected, would one stick with fixed statistics.
Consider statistics for system classes
It's worth noting that the new collected stats model is applicable to tables whose definition resides in a read-only database, including IRIS system tables such as Ens.MessageHeader
. For such tables, table stats were impractical or even impossible to maintain before, but now they will become relevant and, when you have the collection system task enabled, will be maintained automatically. When using Interoperability productions or other pieces of IRIS infrastructure that may involve SQL under the hood, we recommend keeping an eye out for performance of search and similar operations after upgrading, and collect statistics manually or enable the system task.
When using Message Search, we noted significant improvements in query performance after adopting the new model, as it could now use much more realistic selectivity and other stats, where previously it relied on crude defaults. As such, we have dropped the fixed statistics we shipped with these system tables.
Take care when importing and exporting across versions
When exporting table definitions with the /exportselectivity=1
flag (the default), statistics are included using a new format that caters to both fixed and the most recent collected statistics, but is incompatible with earlier versions. To support importing into IRIS instances running earlier releases, use /exportversion=2025.1
or /exportselectivity=0
as appropriate. Please note that collected statistics that are exported using this format will still be imported as collected statistics, and not silently become fixed because they were included in the class definition export file. Given these nuances, you may want to review your source control strategy to ensure the right information is tracked for your deployment model. Note also the symmetrical /importselectivity
flag that can be used when importing class definitions.
The Import()
and Export()
methods in the $SYSTEM.SQL.Stats.Table
class have also been extended with an additional type argument to properly differentiate between both types of stats. See the class reference for details.
Future work
This release includes all the relevant infrastructure to take advantage of the new model. Apart from any minor convenience enhancements, we plan the following two updates in a next release (presumably 2025.3)
Enabling the automatic collection task
As noted earlier, we introduced a new system task for automatic collection of table statistics during a nightly batch window but kept it disabled in order to avoid unexpected IO load post-upgrade. In a next release, we will switch this to on by default, if not already enabled by the customer at that point.
Smarter sample size
We noted that our fast block sampling algorithm may in some cases overestimate the selectivity of highly selective fields in large tables, meaning the optimizer might not use corresponding indices when that would have been optimal. We're implementing an algorithm that will dynamically refine the sample size if we detect too much variability between samplings. This change is mostly independent of the overall infrastructure change described earlier in this article.
Share your experiences
We've been looking forward to rolling out this change for a long time, and are really glad it's now being included with IRIS 2025.2. We've taken care to test this change in many different setups and are confident the new infrastructure is robust, but realize the changes may touch highly customized deployment practices, including namespace mappings and code packaging. Therefore, we welcome any feedback you'd have, and look forward to making the new infrastructure make your life easier.