Quarterly SingingSQL Newsletter #2
If you wish to unsubscribe at any time, just drop me a line at firstname.lastname@example.org. I’ll likely eventually get around to setting up a proper email group, through some third party, so that eventually people can subscribe and unsubscribe without communicating with me, but this is simplest for a start, I think.
I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter02. I’ll also have links scattered throughout for supplementary materials.
Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.
If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at email@example.com – I’m always happy to hear from you!
To see a cumulative listing of all news, including old news from past newsletters, see all news.
To see a cumulative listing of all links, including links already listed in past newsletters, see all links.
Fixing Broken SQL: In this presentation to Hotsos, I presented material with the following abstract: Most developers spend most of their time maintaining code they did not originate. Frequently, developers must fix SQL without knowing its business purpose; this is especially true of performance specialists, who often work on SQL for applications, or parts of applications, that are unfamiliar. Surprisingly, there are patterns in most incorrect SQL that you can learn to recognize and repair even without knowing the purpose of the SQL. This presentation reviews a breakdown of the most common recognizable patterns in broken SQL, and teaches how to find and fix what’s broken with minimal information apart from the bare SQL. The resulting repairs frequently enable better performance, usually fix corner-case functional defects, and invariably make the SQL clearer and easier to maintain. For tuning specialists, an understanding of these patterns enables an unexpected bonus service—while tuning someone else’s SQL, you also find and fix a subtle functional problem they did not even know existed.
I recently ran into this problem on a major production system, and I suspect it is hurting many other systems, so I offer this as a short object lesson. It may be primarily an issue on Oracle, since Oracle makes it so easy and tempting to fall into this trap. (Question to my readers who use databases other than Oracle: Does your database support table-rebuild syntax that would make this trap all-too-easy to fall into, too? I’ll publish your answer and cite your name (if you wish).) However, even on non-Oracle databases that do not make it so easy to fall into this trap, you could still have this problem if you engineered a parallel table rebuild manually.
Here’s the scenario: Users reported a serious drop in performance for a module. I’d already been running automated collection of “V$” data from Oracle to see the breakdown of runtime for all modules according to SQL statement, and to capture the actual execution plans seen at the time by those SQL statements. Rapid analysis of this already-available data demonstrated that one particular statement was responsible for the performance degradation (creating much more load after the reported problem than before), but that statement was using the same plan as before the problem occurred, and reading the same number of rows as before. The extra runtime was accounted for (again, according to my already-collected data) by a much poorer hit ratio for the query (by a greater number and cumulative duration of physical I/Os, that is, while seeing the same number of logical I/Os as seen before the problem). By “poorer hit ratio” (if you aren’t familiar with the term “hit ratio”) I mean that a larger fraction (compared to before the problem) of the block reads that the query needed were going straight to physical disk, because they were not in the block buffer cache maintained for recently-read blocks in Oracle. (All databases cache most recently read blocks, so this is not anything special to Oracle.) I shortly learned that a large table involved in the query had been rebuilt by database administration, in order to help performance, yet it turned out that the rebuild was precisely what had killed performance, of this query and, as it turned out, of many other queries!
Here is the all-too-tempting trap that Oracle offers anyone who is rebuilding a large table:
A common way to rebuild a table is to execute a statement such as
(This would be followed,
normally, by dropping OLD_TABLE (or maybe renaming it to something like
OLD_TABLE_BACKUP, if you’re being conservative), then renaming
Where the table being rebuilt is really large (and really large tables are usually the only ones that need to be rebuilt), Oracle has helpfully provided a super-simple way to speed this up, however:
Wow, just by adding “PARALLEL 10”, you get a roughly tenfold speedup of the table-copying statement! (The PARALLEL clause specifies that table-creation happens with that degree of parallelism, if possible – multiple engines work on the rebuild simultaneously. “10” is just used as an example, here – you can choose any degree of parallelism that you want, within reason.) Now, a smart DBA will also read the fine print, and learn that the PARALLEL clause also specifies that future queries against this table will try to run with the same specified degree of parallelism, if it would help. Since that is frequently not a good idea, and since the smart DBA is, after all, just looking to speed up the rebuild, and is not looking to destabilize future execution plans, the smart DBA (assuming that the former PARALLEL setting for the table was “1” will follow up the rebuild with:
This speedup of the rebuild very understandably seemed like an obviously good thing, so the DBAs took full advantage of it. Since my client’s DBAs are excellent, they had no trouble getting this right, and correctly reset the table parallelism to NOPARALLEL after the table rebuild, since we have no need for default parallelism on the database (all SQL being so fast without it that parallelism isn’t needed at all).
Normally, the result of a table rebuild will be a more-compact table and indexes, and elimination of chaining (rows that don’t fit in the same block they started in, as a result of updates making rows longer after they are created), and this would lead everyone to expect better caching after the rebuild than before, so what happened?!? (Think about it for a minute before reading on, if you want an exercise.)
Here’s what happened, under the covers: Oracle has no more-efficient way to read rows quickly in an engine (given that it needs to eventually read all the rows) than a full table scan. If it is going to spread the load over 10 engines, however, there is no point in every engine reading the whole table. The trick then, is to assign sub-parts of the full table scan to each engine. If the table is 100000 blocks long, Oracle will assign the first engine to do a sort of sub-table-scan of the first 10000 blocks (a sort of full table scan that stops at the 10000th block, that is), and the second engine will get a sub-table-scan of the second 10000 blocks (a scan that starts at the 10001st block, and ends at the 20000th block, that is), and so forth. All of these engines will be inserting the rows that they read, as they read them, at the same time, to the top block of the new table being built, however! If you used “PARALLEL 2”, this would look (treating each row as a “card”) like shuffling a deck of cards once (an analogy particularly for my poker-playing readers). A ten-way shuffle is harder to visualize, but that’s what you get with “PARALLEL 10”.
(Use of freelists might help, here. Freelists could mean that each of the parallel engines would insert into different blocks, which would at least avoid shuffling rows within any given blocks, a major improvement. However, there would still be a shuffling effect, because blocks that used to be close together would end up far apart. Since most disk subsystems do “read ahead” that pre-loads nearby disk blocks into a disk-subsystem cache (which is distinct from the database cache), “physical” I/Os (from the perspective of the database) to nearby blocks tend to be satisfied from the disk-subsystem cache (meaning that they are not really physical, at the disk-subsystem level), often, where nearby blocks are somehow related to each other (for example, representing rows that originated almost at the same time), so the average time to perform a “physical” I/O will be lower when this disk-subsystem cache is effective. The disk-subsystem cache of read-ahead blocks will be much less effective, however, if blocks end up shuffled willy-nilly.)
Now, there should never be a functional issue with shuffling rows in a table – one of the fundamental tenets of relational database theory is that the database is not obligated to store rows in any particular order, physically, and is only obligated to return rows in a particular order if you explicitly call for that with an ORDER BY clause. Unfortunately, another fundamental tenet of relation database theory is that the database makes no promise of how long it will take to return the rows, so once we consider performance, we have to step outside the theoretical box.
In the real world, then, almost all large tables have rows in the table that are needed frequently, and rows that are rarely needed. There is a strong tendency for the most-often-needed rows to be created at close to the same time. For example, in most tables that track business events (orders, payments, account entries, customer events, et cetera), the most-recent rows are much more frequently queried than older rows. In a few cases, such as a customers table, the oldest rows tend to represent the most-important entities, but few large tables are accessed in such a way that users read old rows with the same frequency as new rows. Normal, heap-type tables tend to place new rows all together, usually at the top of the range of blocks storing each table, so there is a natural tendency to cluster “hot” rows, the rows that we need most frequently. This is very useful when the database caches blocks, because once some query (possibly our own, but potentially someone else’s) reads a row, the cached block holding that row is very likely to have other useful rows that will be accessed before the blocks leaves the cache. The more hot rows tend to clump together, the fewer blocks we need to cache to cache all (or nearly all) the hot rows.
If all the hot rows are stored together at the top of the table structure, though, and we then shuffle the table by copying it in a parallel rebuild, the new copy will be much more poorly clustered, for example, scattering the former top-tenth of the table throughout the new copy, where the copy is done with “PARALLEL 10”! With poorer clustering, a given-sized cache will then deliver a much poorer hit ratio, and this is just what happened in my scenario – the table rebuild had destroyed the natural clustering of the hot rows, and the hit ratio for that table had dropped through the floor, as a result.
As a result, I can see few scenarios where a parallel table rebuild of an ordinary heap table would be a good idea in the long run, and I recommend extreme caution if you contemplate doing this. (If you are rebuilding a single table, you can probably take the time to do it serially. If you are rebuilding many tables together, you can probably achieve the parallelism that you need by using multiple parallel scripts that each rebuild different tables at the same time, but that rebuild each individual table serially. If you have such a huge table that a serial rebuild would be a problem for just the one table, then you probably should make that table a partitioned table, so that each individual partition can be rebuilt serially, but the several partitions can be rebuilt in parallel, if necessary.)
I’d like to describe a class of SQL tuning problems that I glossed over, somewhat, in SQL Tuning. A fraction of problems in most business applications, and a somewhat larger fraction of problems encountered in the course of database administration can be described roughly as follows:
There are two very large sets of result rows, which we’ll call “Set-A” and “Set-B”, which are relevant to the query we want to build. Both these result sets (rowsets) are so large that even the most efficient possible query of either set, alone, would be too slow. As is the case for most reasonable queries, however, the result set that we really want is a small result set (sometimes even an empty result set – we may just be trying to confirm that the result is empty, to confirm that there is no problem, where a non-empty end-result set would point to a problem!). This small end-result set desired is either the intersection or the difference between the two very large rowsets Set-A and Set-B.
If you visualize the query diagram for the hypothetical query of Set-A, you should imagine a normal-looking join tree having no highly-selective filter ratios attached, since a highly-selective filter ratio (a filter ratio close to zero, that is) would prevent Set-A from being so large. The same goes for the diagram for the hypothetical query of Set-B. For example, the query diagram for Set-A might look like:
Meanwhile, the query diagram for Set-B might look like:
Finally, then, the query diagram for the end-result query might look like
Naively, anyone who understands query diagrams, looking at this last query diagram would expect the query to return roughly 10% of the number of rows in A, since the filter on A discards 60% of A’s rows, then the join to B would reach a filter that would discard another half of the rows that are left, then the join to C would reach another filter that discards another half of the remaining rows. Any cost-based optimizer would come to the same conclusion, and would be likely to just call for hash joins or sort-merge joins of full table scans of each of these tables, discarding rows failing the filter conditions during the full table scans. Assuming that A is a very large table, this would not be a small result set. When I put together a query diagram that looks like the one just above, with large tables and poor filters (or even when I just notice that a query appears to have no very selective filter), I suspect that I am likely dealing with an overlap query, a query looking for the small intersection of two large sets, because most reasonable business-application queries just don’t need to return so many rows. (In Chapter 10 of SQL Tuning, I discuss exceptions to this rule, where you find a query that does return a very large rowset, and what to do about them.) In the example, however, the truth value of the filter condition on B is not statistically independent of the truth value of the filter condition on C (the truth values are anti-correlated), and, in fact, the two filter conditions are almost never both true for a B-entity that joins to a C-entity. Therefore, if we join B and C, we end up with not 25% of the rows in B, but, instead, with almost no rows at all (or even no rows), and then a nested-loops join to A following an index on A’s foreign key pointing to B is fast and easy. If it happens that B and C are both (unlike A) small enough to allow rapid full table scans, then, a hash join of B and C, followed by nested loops to A, will solve the problem very nicely. Even if B (and maybe C, too) is too big for this to be a really fast option, it still may be your best option, short of denormalization.
Describing this problem in plain English, the conditions on B and C are roughly mutually exclusive, and we want a query to find the rare exceptions to the mutual-exclusivity rule. This is just the sort of thing that we often need to do in business applications, when solving glitches in the business processes, for example, looking for non-shippable orders (license renewals, for example) where the company billed (incorrectly) for shipping. (The good news is that if such exceptions are analyzed correctly, and the root-cause problems are solved, you should end up with an empty result, and you may not need to run the query any longer, or anyway only often enough to confirm that the problem has not returned.) This is also just the sort of thing DBAs often need to do, when looking for corrupt data, orphaned rows, or rows where denormalized data is out of sync with correct, normalized values. As for the business-process glitches, the need for these DBA exception-condition reports goes away or becomes less urgent once the root causes for the glitches are fixed.
Occasionally, the mutually-exclusive conditions are on the same table. In this happy case, the problem is fairly easy to solve – just create (if it is not there, already) a multi-column index, or a functional index, that enables an index range scan that goes directly to the rare exceptions. The only thing to watch for, here, is that it is probable that the optimizer won’t like the desired range scan, because the optimizer will assume, incorrectly, that the conditions are statistically independent. Therefore, it is likely that you’ll need an index hint to force use of the index.
Where you find mutually-exclusive conditions on different tables, put together the minimal query that joins the tables containing the mutually exclusive conditions, and work on optimizing just this query. (This simplified query may eliminate many joins in the original query that were there not to confirm the violation of mutual exclusivity, but, instead, just to provide supporting detail for understanding the problem cases.) Try to tune this simplified query following the normal query-tuning process, keeping in mind that in the absence of any good filter ratios, you are likely to be preferring hash joins (or sort-merge joins, if your RDBMS doesn’t do hash joins) of full table scans. If the simplified query is then fast enough (hopefully the simplified query doesn’t involve any really large tables), then add back the rest of the original query, making sure that the execution plan of the complete query starts with the same steps as the optimized, simplified query, then reaches the rest of the tables through nested loops to the join keys (which you may have to force with hints, since the optimizer likely won’t realize how few rows are likely to be left at that point in the execution plan).
Where the best you can do after routinely tuning the minimal query that joins the tables containing the mutually exclusive conditions still is too slow, apply a standard set of tricks that apply pretty well to all queries that appear to be stuck with long runtimes:
If the combination of all those tricks still does not resolve the problem, another trick is often possible in these cases, even though this trick does not apply to other long-runtime queries, in general:
Formulate a new, more-selective query that just looks for recent violations of the mutual-exclusivity rule. If (as is often the case) the purpose of this query is to find examples of business-process glitches (or data glitches, for the DBA problems), so that the root-cause problems behind the examples can be discovered, then you’re probably better off focusing on recent examples, anyway, since old examples may have come from problems that have long-since been solved. By focusing on recent examples, you can add a filter condition on some table that returns only the most recent rows of one of the tables involved in the mutually-exclusive conditions, a filter condition that is likely selective enough to solve the long-runtime problem, especially if it uses an index.
If the query is still too slow, even after all the tricks so far described (and this is quite rare), then you likely will need to denormalize the data. Pages 265 and 266 of SQL Tuning describe how to promote filters upward in a join diagram until they belong to a single table, using denormalization. This can (and should) be done, generally, with database triggers that ensure that the denormalization is rigorously consistent, even if the application developers aren’t aware of the need to maintain it. Once the mutually-exclusive conditions are promoted into a single table, then a well-chosen multi-column index, or a functional index, on that table will reach the rare exception rows with a rapid index range scan.
To see a cumulative listing of all research problems (those below, and the earlier one from the first newsletter), see all research problems. There has been no solution submitted yet for the earlier-published research problem.
If you look at stored outlines in Oracle, it appears that Oracle has some undocumented hints that it uses, at least internally. What undocumented hints are available on each RDBMS (Oracle, SQL Server, et cetera), what do they do, and can they be used in user-created SQL, or just internally? (This is an open-ended problem, clearly, so I am not so much looking for a complete answer all at once as any information out there that represents progress toward an answer.)
On most systems, vast system resources are simply wasted, in terms of disk space, memory, and CPU. This likely represents an opportunity, since these resources could be used in conservative ways that would never significantly affect ordinary load, while performing valuable background services. For example, during times of low CPU and I/O consumption, index blocks, in B*-tree indexes, could be continually, incrementally rebuilt, getting rid of enormous amounts of unused space (especially in indexes with common deletes and updates), resulting in more-compact, better-cached indexes without DBA intervention. As another example (which Oracle 10g already does, to a limited degree), post-execution optimization could be performed on high-resource-consumption SQL, automatically, in the background, to find better execution plans for that SQL the next time it executes.
©2006 Dan Tow, All rights reserved