Quarterly SingingSQL Newsletter #6
If you wish to unsubscribe at any time, just drop me a line at email@example.com.
I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter06. I’ll also have links scattered throughout for supplementary materials.
Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.
If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at firstname.lastname@example.org – I’m always happy to hear from you!
I got a little tied up, so this newsletter will suffice for second and third quarter of 2007.
To see a cumulative listing of all news, including old news from past newsletters, see all news.
To see a cumulative listing of all links, including links already listed in past newsletters, see all links.
A Taxonomy of Problematic SQL: In this seminar, I present a formal way to classify most of the SQL that proves problematic from both a performance perspective and a functional perspective. Armed with this classification scheme, it is much easier to recognize problems and their solutions. Here is the whitepaper that goes with the presentation.
A surprising amount of tuning work goes into overriding optimizers’ tendency to join tables with hash joins and sort-merge joins to full table scans. It’s not obvious why optimizers, which have been refined for years, should still seem to under-favor the nested-loops alternative, but I have some thoughts on the subject:
Most data in a business application consists of bundles of information detailing a business event, an entity that can be located on a timeline in the history of the business. The most obvious prototype for the business event is an order, but information related to that event will be tracked in several tables that would join directly or indirectly to the Orders table, including tables tracking order details, shipments, invoices, payments, commissions, et cetera. The entire cluster of information related to the highest-level master table in the hierarchy will likely be created within a short time window, as one business event triggers rapid follow-up events, and many rows in master-detail relationships typically will be created in a single transaction, such as orders and order details, which would each be meaningless without the other.
When we query business data for most purposes, we typically need to see data related to quite recent business events. These recent events may be important to query because we need to monitor the current health of the business or because the events are still unfinished, with tasks required to fully complete the business process related to the order or other business event. Business applications boil down to tools to trigger the business to make the right actions and decisions, today. “Ancient history,” which is to say an event older than about a year, is rarely relevant to business actions and decisions that need to be made today, and even events a month or two old are not often relevant to day-to-day business operations.
In a typical heap table, the data for these recent events resides together at the top of the table, or occasionally together in older blocks that were purged of data so old that it is longer needed. Because the rows representing the most recent data reside together in a small subset of the table blocks, an execution plan for a query of these rows tends to find multiple rows needed by the query in each of these blocks. This self-caching effect is very useful, where blocks needed by the query are cached early in the query, then reused from the cache multiple times during the query. Even early in the query, the first time these blocks are needed, they are likely already to be cached by other recent queries, which also tend to need blocks from that small subset of the table that holds the recent rows.
When we query a really tiny subset of an events-data table, and the database can see that the subset is tiny, then the optimizer has no trouble figuring out that the joins to related-events tables will also read just a tiny subset of those tables, and nested loops plans to those tables are an easy choice. There are two cases, though, where the choice stymies the typical optimizer:
Consider each of these cases, in turn. As savvy human tuners, we should know that a typical business report provides a small enough set of return data that a human would find the report useful. Ten or twenty pages is about the outside limit of report length a human is likely to read from end to end, corresponding to the result of a maximum of about 1000 rows returned from a query. Even in a data-warehousing context, really huge query results should be the exception, rather than the rule. Therefore, if the filters on a query appear unselective, the savvy human tuner should suspect that either the report is poorly designed, from the perspective of the user-interface, or the filters, in combination, are actually more selective than they appear. This happens surprisingly often in business queries, which tend to look for exceptions that could be described abstractly as:
“If X is true, then Y should not also be true.”
The query for exceptions to the rule, which would call for business action (either action to fix the specific exception, or, better still, action to analyze and prevent future occurrences of the exception), would look like
where not <X> and not <Y> and <joins to other tables with supporting data>;
My favorite example of this sort of query is a query looking for orders that are neither closed nor recent, since orders should not stay in the “open” state long. Often (assuming well-designed business processes), exceptions are very rare, but each of the conditions (not <X> and not <Y>) by itself may be quite common. Since optimizers almost invariably assume statistical independence for pairs of conditions like this (a very wrong assumption in these cases!), they frequently grossly overestimate how many rows they’ll reach at the point where both conditions can be applied, and this causes a gross overestimate in the rows to be joined to subsequent tables, case #1, above. In these cases, if the optimizer understood what the savvy human tuner understands, that the query probably won’t return over a thousand rows, then the optimizer would tend to favor nested loops following the join key, rather than an incorrect choice to hash join or sort-merge join to a full table scan of the later table. Unfortunately, the optimizer has no preconception that rowcounts from reasonable queries tend to be small, and it makes the choice to favor a plan that is only justified if that result turns out to be unreasonably large, as it simplistically appears it will be.
Now consider case #2 above, the case that the rowcount really is quite large, or at least it is large before some group-by sums up a large result set. (Reports of over 1000 rows should be rare, but short reports that sum over 1000 rows make perfect sense in the business context!) As our example, let’s say we want to look at the most recent month of data out of a 4-year history. Let’s further assume that there are 96 rows of data per database block in the driving table. Now, consider the join to a related events table, which we’ll assume has a three-deep index tree on the join key, and twice as many blocks as the driving table. We won’t count the root block in the index tree in our calculations, because we assume that root blocks are perfectly cached.
If we follow nested loops to the second table, for each row from the driving table, we’ll do 2 logical I/Os (not counting the root block) to the join-key index, and a logical I/O to the joined-to table block, or 3 logical I/Os in all per joined-from row. Since we are reading 1/48th of the joined-from table, but have 96 rows/block in that table, we’re reading twice as many rows as we have blocks in that table, which is the same number of rows as the number of blocks in the joined-to table. Since we counted 3 logical I/Os (not counting the root index blocks) per joined-from row, we count 3 times as many logical I/Os to reach the joined-to table by nested loops as the total block count in that same table! All the optimizer has to do to favor a hash join or a sort-merge join to a full table scan of that joined-to table is to decide that reading the table once in a single pass is better than reading three-times as many blocks one block at a time, with logical I/Os driving through a join-key index, in nested loops! Unquestionably, we should expect a higher logical I/O count with the nested-loops plan here, so does that mean the nested-loops plan is inferior?
Assume, first, that all blocks are cached. There is a CPU overhead simply to perform a logical I/O, and the nested-loops plan will surely cost more CPU for its logical I/Os. Once we have performed the logical I/O, however, there is also overhead for what we do with the block. In the nested-loops plan, we read a small fraction of the block. The runtime for both the logical I/O and reading this small fraction is typically under 10 microseconds. Unless the rowcounts reached are truly huge, or the join-order is inefficient, reaching far more rows than the rowcounts that satisfy all the filters, the CPU costs for these nested loops are insignificant! (With bad join orders, hash joins to full table scans are much more likely to be a “win” than with correct join orders!) Consider the blocks we read for the join alternative that reaches the joined-to table with a full table scan, however: For these blocks, the database must view the entire block, every row in the block, and this typically takes much longer than the 10 milliseconds per block we need for the nested-loops alternative. In all, CPU efficiency shows a trade-off – more CPU for just for the individual logical IOs for the nested-loops alternative, here, but less CPU for the work done inside each block, and the nested-loops plan is better than it appears, from just a CPU-consumption perspective.
Now assume that the blocks are not all so well cached. This is much more realistic, if these tables are big enough that we really need to worry about the tuning problem. The full table scan encounters blocks (or, in the case of Oracle, multi-block groups of blocks, typically 8-block groups of 8K blocks) from the entire table, including the majority of the table that qualifies as “ancient history.” These old blocks will be extremely poorly cached, so physical I/O will be high for the non-nested-loops alternative. Consider the nested-loops alternative, though: The joined-to key values and joined-to related-events rows will tend to be roughly one month old, or less, just like the rows in the driving table. There are likely no more than a few hundred index blocks, at the most, holding the necessary recent join-key data. These are likely well-cached before the query even starts, but even if they aren’t they will be read in and cached (and re-used many times) for the remainder of the query during the first couple of seconds, likely. The table blocks for the most-recent month of the related-events table will be slightly less-well cached, at first, but even these blocks will tend to be read in roughly the same order they are stored, at most a physical-I/O count equal to about 1/48th of the blocks in the table. Even this physical I/O count exaggerates the work on disk, because typically read-ahead performed in the disk subsystem will act like a multi-block read, reaching just the blocks needed next before they are needed, and caching those blocks in the disk subsystem memory so that most of what the database thinks is a physical I/O request turns out actually to be a super-fast read from the disk-subsystem cache. In all, the expected time for the physical I/Os to follow the nested-loops alternative is vastly better than the non-nested-loops alternatives, in this example, and the time-savings is overwhelmingly more important than any potential cost for extra logical I/Os.
Time-related data tends to cluster together in tables, and time-related data tends to join to other time-related data in well-clustered ways, as well. I refer to this as automatic co-clustering. From the physical I/O perspective, two tables that have joined rows generally created at about (or exactly) the same time act almost as if their rows were stored together in the same blocks – the number of physical I/Os and physical I/O time (which tends to dominate in well-tuned cases like this) necessary to read joined rows grouped well in creation time is roughly the same as it would be if the tables were reorganized into Oracle multi-table clusters.
Consider a join of the most recent 100 blocks of Orders rows with the most recent 200 blocks of Order_Details: However high the logical I/O for a nested-loops plan might be, the query will hit just 300 distinct table blocks and perhaps a dozen of so join-key blocks (which were likely cached, anyway), and even a pessimist shouldn’t expect more than 300 physical I/Os. If we reasoned that Orders and Order_Details are invariably joined to one another, and belong in a two-table Oracle cluster, the very same rows would fit in roughly the last 300 blocks of such a cluster, for no significant net savings in physical I/O! In precisely the scenarios where clusters look attractive (rows of related tables are predictably created together), the savings in physical I/O turn out to be trivial or non-existent! There is a large savings in logical I/O, but in well-designed queries like this, this matters little. The biggest difference between the clustered and non-clustered example is in the optimizer’s (mistaken) estimate, in cases like this – the optimizer will correctly estimate a low cost for the clustered case, but it will (incorrectly) estimate a high cost for the case of the nested-loops join between single tables.
Is there any case where the optimizer is right to be
pessimistic? It turns out that there is such a case – if the driving condition
is not correlated with row-creation time (or if the table rows have become
scrambled with respect to row-creation time, see
To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.
What data would optimizers need to calculate the tendency for two specific tables queried on a specific filter to co-cluster, so that the optimizer could correctly estimate self-caching in the joined-to table? What would be the formula to apply that data to a correct estimate of the performance benefits of co-clustering?