Quarterly SingingSQL Newsletter #6

September 28, 2007

Introduction

If you wish to unsubscribe at any time, just drop me a line at dantow@singingsql.com.

I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter06. I’ll also have links scattered throughout for supplementary materials.

Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.

If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at dantow@singingsql.com – I’m always happy to hear from you!

I got a little tied up, so this newsletter will suffice for second and third quarter of 2007.

SingingSQL News of Public Events

Old Old News

To see a cumulative listing of all news, including old news from past newsletters, see all news.

Old news

I presented my old seminar, Fixing Broken SQL at the Northern California Oracle Users Group (NoCOUG) conference in Redwood Shores, California, in February 2007.
I presented my new seminar, Getting SQL Performance Right the First Try (Most of the Time) at the HOTSOS conference in Dallas in March 2007. To my knowledge, this was the first-ever sing-along HOTSOS presentation! (There is a recurring “chorus” in this presentation: “If you don’t know enough to… (something you ought to know before starting to write the SQL)… You don’t know enough to write the SQL!” The HOTSOS crowd, good sports all, sang the chorus along with me, each time this chorus came up! I was, however, merciful enough to my very tolerant audience to speak the rest of the presentation. Here is the whitepaper that goes with it.

I presented my new seminar, A Taxonomy of Problematic SQL at the Collaborate07 OAUG conference in Las Vegas April, 2007. Here is the whitepaper that goes with it.
I plan to expand my class offerings outside of the San Francisco Bay Area, but I’d like to do this mainly through partners who already know prospective students in their area, likely partners who receive this newsletter. If you think you can find 6 or more paying students in your area, and you’d be interested in a chance to share in the profit from the class, drop me a line, and we can work out a deal for a class in your area. (If you work for a company, and your company wants a class just for its own people, I already have an offering for classes presented on demand for company-internal audiences.)

I’ve had a few people ask about coming class offerings, and it occurred to me that I don’t have a good way to announce them if they come up suddenly, because I want to keep my promise that this newsletter only comes just four times a year. Therefore, I’m starting a new list for anyone interested in announcements of classes and other offerings (such as presentations to local OAUGs) that might come up without enough warning to make it into this newsletter. If you are interested (and there is certainly no commitment in this, on your part) just drop me a line at dantow@singingsql.com, and I’ll add you to this new list. Please be sure to include your location and some range of area within which you’d consider attending something if I happened by. The area can be as broad as a continent or as narrow as a town – “Let me know if anything comes up in Asia.” to “Let me know if you have a class coming to Darwin, Australia.” Even if you’re off the beaten path, we might get lucky. I won’t send notes for this second list except to the people in the area where I’m offering something, and I doubt you’ll see more than a note or two from this new list per year, even if your area is broad.

New news

I have nothing scheduled at this time.

Links

Old Links

To see a cumulative listing of all links, including links already listed in past newsletters, see all links.

New Links

A Taxonomy of Problematic SQL: In this seminar, I present a formal way to classify most of the SQL that proves problematic from both a performance perspective and a functional perspective. Armed with this classification scheme, it is much easier to recognize problems and their solutions. Here is the whitepaper that goes with the presentation.

Why Nested Loops Win So Often

A surprising amount of tuning work goes into overriding optimizers’ tendency to join tables with hash joins and sort-merge joins to full table scans. It’s not obvious why optimizers, which have been refined for years, should still seem to under-favor the nested-loops alternative, but I have some thoughts on the subject:

Most data in a business application consists of bundles of information detailing a business event, an entity that can be located on a timeline in the history of the business. The most obvious prototype for the business event is an order, but information related to that event will be tracked in several tables that would join directly or indirectly to the Orders table, including tables tracking order details, shipments, invoices, payments, commissions, et cetera. The entire cluster of information related to the highest-level master table in the hierarchy will likely be created within a short time window, as one business event triggers rapid follow-up events, and many rows in master-detail relationships typically will be created in a single transaction, such as orders and order details, which would each be meaningless without the other.

When we query business data for most purposes, we typically need to see data related to quite recent business events. These recent events may be important to query because we need to monitor the current health of the business or because the events are still unfinished, with tasks required to fully complete the business process related to the order or other business event. Business applications boil down to tools to trigger the business to make the right actions and decisions, today. “Ancient history,” which is to say an event older than about a year, is rarely relevant to business actions and decisions that need to be made today, and even events a month or two old are not often relevant to day-to-day business operations.

In a typical heap table, the data for these recent events resides together at the top of the table, or occasionally together in older blocks that were purged of data so old that it is longer needed. Because the rows representing the most recent data reside together in a small subset of the table blocks, an execution plan for a query of these rows tends to find multiple rows needed by the query in each of these blocks. This self-caching effect is very useful, where blocks needed by the query are cached early in the query, then reused from the cache multiple times during the query. Even early in the query, the first time these blocks are needed, they are likely already to be cached by other recent queries, which also tend to need blocks from that small subset of the table that holds the recent rows.

When we query a really tiny subset of an events-data table, and the database can see that the subset is tiny, then the optimizer has no trouble figuring out that the joins to related-events tables will also read just a tiny subset of those tables, and nested loops plans to those tables are an easy choice. There are two cases, though, where the choice stymies the typical optimizer:

The filter on the driving table reaches a tiny fraction of the events, but the optimizer misestimates this fraction, estimating a much larger rowcount passing the driving filter than is the actual case.
The filter on the driving table reads a moderately large fraction of the data, for example a whole month of data out of a 4-year history.

Consider each of these cases, in turn. As savvy human tuners, we should know that a typical business report provides a small enough set of return data that a human would find the report useful. Ten or twenty pages is about the outside limit of report length a human is likely to read from end to end, corresponding to the result of a maximum of about 1000 rows returned from a query. Even in a data-warehousing context, really huge query results should be the exception, rather than the rule. Therefore, if the filters on a query appear unselective, the savvy human tuner should suspect that either the report is poorly designed, from the perspective of the user-interface, or the filters, in combination, are actually more selective than they appear. This happens surprisingly often in business queries, which tend to look for exceptions that could be described abstractly as:

“If X is true, then Y should not also be true.”

The query for exceptions to the rule, which would call for business action (either action to fix the specific exception, or, better still, action to analyze and prevent future occurrences of the exception), would look like

Select …

where not <X> and not <Y> and <joins to other tables with supporting data>;

My favorite example of this sort of query is a query looking for orders that are neither closed nor recent, since orders should not stay in the “open” state long. Often (assuming well-designed business processes), exceptions are very rare, but each of the conditions (not <X> and not <Y>) by itself may be quite common. Since optimizers almost invariably assume statistical independence for pairs of conditions like this (a very wrong assumption in these cases!), they frequently grossly overestimate how many rows they’ll reach at the point where both conditions can be applied, and this causes a gross overestimate in the rows to be joined to subsequent tables, case #1, above. In these cases, if the optimizer understood what the savvy human tuner understands, that the query probably won’t return over a thousand rows, then the optimizer would tend to favor nested loops following the join key, rather than an incorrect choice to hash join or sort-merge join to a full table scan of the later table. Unfortunately, the optimizer has no preconception that rowcounts from reasonable queries tend to be small, and it makes the choice to favor a plan that is only justified if that result turns out to be unreasonably large, as it simplistically appears it will be.

Now consider case #2 above, the case that the rowcount really is quite large, or at least it is large before some group-by sums up a large result set. (Reports of over 1000 rows should be rare, but short reports that sum over 1000 rows make perfect sense in the business context!) As our example, let’s say we want to look at the most recent month of data out of a 4-year history. Let’s further assume that there are 96 rows of data per database block in the driving table. Now, consider the join to a related events table, which we’ll assume has a three-deep index tree on the join key, and twice as many blocks as the driving table. We won’t count the root block in the index tree in our calculations, because we assume that root blocks are perfectly cached.

If we follow nested loops to the second table, for each row from the driving table, we’ll do 2 logical I/Os (not counting the root block) to the join-key index, and a logical I/O to the joined-to table block, or 3 logical I/Os in all per joined-from row. Since we are reading 1/48^th of the joined-from table, but have 96 rows/block in that table, we’re reading twice as many rows as we have blocks in that table, which is the same number of rows as the number of blocks in the joined-to table. Since we counted 3 logical I/Os (not counting the root index blocks) per joined-from row, we count 3 times as many logical I/Os to reach the joined-to table by nested loops as the total block count in that same table! All the optimizer has to do to favor a hash join or a sort-merge join to a full table scan of that joined-to table is to decide that reading the table once in a single pass is better than reading three-times as many blocks one block at a time, with logical I/Os driving through a join-key index, in nested loops! Unquestionably, we should expect a higher logical I/O count with the nested-loops plan here, so does that mean the nested-loops plan is inferior?

Assume, first, that all blocks are cached. There is a CPU overhead simply to perform a logical I/O, and the nested-loops plan will surely cost more CPU for its logical I/Os. Once we have performed the logical I/O, however, there is also overhead for what we do with the block. In the nested-loops plan, we read a small fraction of the block. The runtime for both the logical I/O and reading this small fraction is typically under 10 microseconds. Unless the rowcounts reached are truly huge, or the join-order is inefficient, reaching far more rows than the rowcounts that satisfy all the filters, the CPU costs for these nested loops are insignificant! (With bad join orders, hash joins to full table scans are much more likely to be a “win” than with correct join orders!) Consider the blocks we read for the join alternative that reaches the joined-to table with a full table scan, however: For these blocks, the database must view the entire block, every row in the block, and this typically takes much longer than the 10 milliseconds per block we need for the nested-loops alternative. In all, CPU efficiency shows a trade-off – more CPU for just for the individual logical IOs for the nested-loops alternative, here, but less CPU for the work done inside each block, and the nested-loops plan is better than it appears, from just a CPU-consumption perspective.

Now assume that the blocks are not all so well cached. This is much more realistic, if these tables are big enough that we really need to worry about the tuning problem. The full table scan encounters blocks (or, in the case of Oracle, multi-block groups of blocks, typically 8-block groups of 8K blocks) from the entire table, including the majority of the table that qualifies as “ancient history.” These old blocks will be extremely poorly cached, so physical I/O will be high for the non-nested-loops alternative. Consider the nested-loops alternative, though: The joined-to key values and joined-to related-events rows will tend to be roughly one month old, or less, just like the rows in the driving table. There are likely no more than a few hundred index blocks, at the most, holding the necessary recent join-key data. These are likely well-cached before the query even starts, but even if they aren’t they will be read in and cached (and re-used many times) for the remainder of the query during the first couple of seconds, likely. The table blocks for the most-recent month of the related-events table will be slightly less-well cached, at first, but even these blocks will tend to be read in roughly the same order they are stored, at most a physical-I/O count equal to about 1/48^th of the blocks in the table. Even this physical I/O count exaggerates the work on disk, because typically read-ahead performed in the disk subsystem will act like a multi-block read, reaching just the blocks needed next before they are needed, and caching those blocks in the disk subsystem memory so that most of what the database thinks is a physical I/O request turns out actually to be a super-fast read from the disk-subsystem cache. In all, the expected time for the physical I/Os to follow the nested-loops alternative is vastly better than the non-nested-loops alternatives, in this example, and the time-savings is overwhelmingly more important than any potential cost for extra logical I/Os.

Time-related data tends to cluster together in tables, and time-related data tends to join to other time-related data in well-clustered ways, as well. I refer to this as automatic co-clustering. From the physical I/O perspective, two tables that have joined rows generally created at about (or exactly) the same time act almost as if their rows were stored together in the same blocks – the number of physical I/Os and physical I/O time (which tends to dominate in well-tuned cases like this) necessary to read joined rows grouped well in creation time is roughly the same as it would be if the tables were reorganized into Oracle multi-table clusters.

Consider a join of the most recent 100 blocks of Orders rows with the most recent 200 blocks of Order_Details: However high the logical I/O for a nested-loops plan might be, the query will hit just 300 distinct table blocks and perhaps a dozen of so join-key blocks (which were likely cached, anyway), and even a pessimist shouldn’t expect more than 300 physical I/Os. If we reasoned that Orders and Order_Details are invariably joined to one another, and belong in a two-table Oracle cluster, the very same rows would fit in roughly the last 300 blocks of such a cluster, for no significant net savings in physical I/O! In precisely the scenarios where clusters look attractive (rows of related tables are predictably created together), the savings in physical I/O turn out to be trivial or non-existent! There is a large savings in logical I/O, but in well-designed queries like this, this matters little. The biggest difference between the clustered and non-clustered example is in the optimizer’s (mistaken) estimate, in cases like this – the optimizer will correctly estimate a low cost for the clustered case, but it will (incorrectly) estimate a high cost for the case of the nested-loops join between single tables.

Is there any case where the optimizer is right to be pessimistic? It turns out that there is such a case – if the driving condition is not correlated with row-creation time (or if the table rows have become scrambled with respect to row-creation time, see LINK), then nested loops may need nearly as many logical I/Os as physical I/Os. For example, if we query all the orders for a particular customer over all time, the driving-table rows will be scattered, and so will the joined-to Order_Details. Consider, though, that well-designed application queries should rarely look like this – such a query will get progressively slower as the tables accumulate more and more history, and why would we want to see “ancient” records of a customer’s orders, anyway? Most queries should reach some time-correlated condition (often combined with a non-time-correlated condition, such as customer ID) in the first mutli-row table read (which may be preceded by one or more single-row reads, such as a read of a single customer record), which then drives to related event-type records created around the same time, using nested loops.

Research Problems

Old Research Problems

To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.

New problem: Estimating Self-Caching for a Join to a Co-Clustered Table

What data would optimizers need to calculate the tendency for two specific tables queried on a specific filter to co-cluster, so that the optimizer could correctly estimate self-caching in the joined-to table? What would be the formula to apply that data to a correct estimate of the performance benefits of co-clustering?

Home