Quarterly SingingSQL Newsletter #3 

August 12, 2006

 

Introduction

If you wish to unsubscribe at any time, just drop me a line at dantow@singingsql.com. I’ll likely eventually get around to setting up a proper email group, through some third party, so that eventually people can subscribe and unsubscribe without communicating with me, but this is simplest for a start, I think.

 

I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter03. I’ll also have links scattered throughout for supplementary materials.

 

Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.

 

If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at dantow@singingsql.com – I’m always happy to hear from you!

SingingSQL News of Public Events

Old Old News

To see a cumulative listing of all news, including old news from past newsletters, see all news.

Old news

  • I presented my 2-day Top-Down SQL Tuning course based on my book in Oakland California, June 1-2.

New news

  • I’m planning a weekend 2-day Top-Down SQL Tuning course, one weekend in September, in the Chicago Area - email dantow@singingsql.com, if you are interested. Cost: $1000 per person. (I have a request from several developers who want the class on their own time and money in that area, and they prefer a weekend class so that they do not have to miss work.) Please respond to this as soon as possible, if you think you might like to come, so I can plan for the correct class size and choose the weekend that works best for everyone. (It will be a cozy little class, I can almost guarantee, so if you want a lot of individual attention, this one is for you!) If you think you might like to come to the class, but you aren't sure, drop me a line, anyway, while you're thinking of it, and I'll send you a reminder a while before the class to make a decision before it's too late. Here's the course outline.
  • I’m considering a class in the Phoenix area soon – if you would be interested in such a class, drop me a line.
  • I plan to expand my class offerings outside of the San Francisco Bay Area, but I’d like to do this mainly through partners who already know prospective students in their area, likely partners who receive this newsletter. If you think you can find 6 or more paying students in your area, and you’d be interested in a chance to share in the profit from the class, drop me a line, and we can work out a deal for a class in your area. (If you work for a company, and your company wants a class just for its own people, I already have an offering for classes presented on demand for company-internal audiences.)

Links

Old Links

To see a cumulative listing of all links, including links already listed in past newsletters, see all links.

New Link

I took part in a group discussion on the topic of “Does the Optimizer Need a Hint” as part of the new “Ask the Oracles” series in the latest Northern California Oracle Users Group Journal. Here’s an expanded version of my take on the question.

Featured Article

Shortcuts for Gathering Tuning Stats Quickly

Some of the longest-running running queries you are likely to encounter while tuning applications SQL are not application SQL at all, but instead are the queries you put together to learn more about the tuning problem, as you must learn to beat the optimizer at its own game and find a better execution plan than the optimizer found. I’ll lay out some tricks I use regularly in the Oracle environment to greatly speed my own tuning. In some cases, these generalize easily to other environments; in other cases, it is not so simple. If you’d like to write up equivalent tricks for your own non-Oracle environment, I’ll be happy to post them on singingsql.com, attributed to you (with a link to your site, if you have one), and I’ll mention the addendum(s) to this article in my next newsletter. Anyway, here is my take on the recursive problem on how to make the process of making application SQL faster, faster, without wasting too much time on tuning the non-applications test SQL.

 

The most important shortcut to avoid spending too much time gathering tuning stats is:

 

Don’t waste time tuning the wrong SQL!

 

In Chapter 5 of my book, I discuss at some length the inputs needed to complete a full query diagram. Once you find the right SQL to tune, which I discuss in my first newsletter, completing the query diagram is the start of any solution following my method. The by-the-book stats you need to complete the full query diagram are:

 

  • The rowcount of each table (not counting tables on the optional side of any outer join).
  • The rowcount of each pair of inner-joined tables, joined with the same inner join as in the query.
  • The rowcount from each table after applying all single-table filters in the query on that table.

 

Getting all this precise data could slow the tuning process considerably, where the tables are large, so it is nice to have some shortcuts. A great starting point is to know the table sizes at least approximately:

 

select num_rows from all_tables where table_name = upper('&&1');

 

This one-liner script, which I put in a file called qcntr.sql (quick-counter) and use constantly will give a quick picture of the approximate table size as-of the last time the table was analyzed, which is usually sufficient. If that script tells me that the table is small (under a million rows is small enough for these purposes), or if stats are missing (which should be rectified) I’ll sometimes take the time to get a perfect, up-to-the-minute rowcount with cntr.sql:

 

set heading off

set feedback off

set verify off

set timing off

spool tmpcntr.sql

select 'select count(*) from '||'&&1'||';' from dual;

spool off

set timing on

@tmpcntr

 

This is surprisingly fast even for a million-row table, typically, because the cost-based optimizer will usually find a path that reads a whole index, rather than a whole table, to get a perfect rowcount, and this is surprisingly fast. Even a full-table scan of a million-row table is faster than most people think, if nothing else is required. Often, if I must run a moderately-long-running query, I’ll leave it running in one window while working out other stats in another window, so the runtime needn’t translate to a waste of your own time.

 

The next statistic, rowcounts of each joined pair of inner-joined tables, is almost never necessary. If it was necessary, you’d need an often-long-running query something like:

 

Select count(*) from Details D, Masters M where D.Fkey_ID = M.PKey_ID;

 

This will usually be much slower than just a full table scan of the largest table, but it is fortunately almost never necessary. Instead, it is usually an excellent, sufficiently accurate guess that the above query will return the same count as the rowcount of the Details table. You can guarantee that the count of the join will equal the rowcount of the detail table if the foreign key pointing to the master table is guaranteed not-null, and referential integrity is enforced for that foreign key. You can check whether the foreign key is guaranteed not-null with a quick DESCRIBE on the table, and usually referential integrity is perfect, or near-enough to perfect for our purposes even if it is not enforced, so where you find a not-null foreign key, you don’t usually need to find a join count. Looking at the worst case, let’s say we find that FKey_ID is allowed to be null, but we don’t know how often it is, and the table is large enough that it is inconvenient to check the full count of NULLs. A quick-and-dirty check just looks for NULLs near the start of the table:

 

Select rownum from Details where FKey_ID is null;

 

I’d run this for 10 or 15 seconds (and hit control-C to bail out) and see whether it scrolls rows onto my screen quickly, slowly, or not at all. If I see no rows at all, NULLs are rare or non-existent, even if they are allowed, at least for the oldest rows in the table. (If I had some reason to suspect that NULLS might be more common for new rows,  I’d look for some indexed path to sample the newer rows, for example, Order_Date > sysdate-1, if this was an ORDERS table, and if Order_Date was indexed, or FKey_ID > {Maximum FKey_ID} - 10000, if FKey_ID was indexed and I suspected that FKey_ID grew with time (as most primary and foreign keys do), and I first did a fast

 

Select max(FKey_ID) from Details;

 

.) Usually, being assured that NULLs aren’t too common, I’ll just assume referential integrity and go with the assumption that the join count equals the Details rowcount. If I’m feeling super-cautious, though, I might do a quick look for missing referential integrity as follows:

 

Select rowcount from Details D where D.Fkey_ID is not null

and not exists (select null from Masters M where D.Fkey_ID = M.PKey_ID);

 

By letting this run for just a few seconds (and then hitting control-C to bail out), you can see by the speed that it scrolls (or fails to scroll) numbers by the screen whether failed referential integrity is common, rare, or super-rare-to-non-existent in the part of the table you’re sampling, without looking at the whole table. (If it makes more sense to sample recent rows, add a clause to the outer query, as above.)

 

If, instead, NULL foreign-key values are common, I’ll usually still assume referential integrity for the non-null values, and I just need to know the frequency of NULLs. If

 

Select rownum from Details where FKey_ID is null;

 

returned rows slowly or not at all, I’d just treat NULLs as too rare to change anything. If NULLs looked common, though, I might need a real NULL count. If the table was not too large, I’d usually run

 

Select count(*) from Details where FKey_ID > 0;

 

And I’d usually just assume the join count equaled that count of positive FKey_IDs. (Note that this assumes that the key values are numeric and positive. For most applications this is almost universally a good assumption. If FKey_ID was indexed, I could first run:

 

Select min(Fkey_ID) from Details;

 

Then

 

Select count(*) from Details where FKey_ID >= {minimum FKey_ID};

 

By expressing the queries this way, I can find not-null rowcounts just by hitting the index on the foreign key (which the optimizer will usually choose), which is normally surprisingly fast even for very large tables.

 

The last statistic we need is the selectivity of each filter condition, at least roughly. The very best trick, here, is an informed guess! Even if you’re working on an application you’ve never seen before, as long as the table and column names aren’t based on some language you don’t know, you can usually make a pretty fair guess what tables represent and what sort of property is stored in each filtered column. (Hint: schema designs that actively hide that sort of information are generally poor and result in hard-to-maintain SQL!) Given such guesses, in useful real-world queries (which do not normally return excessive rowcounts) you can also usually guess to find a filter that is highly selective, usually the most selective or at worst the second-most-selective filter in the query. If that filter is also indexed, or if that filter can be rephrased so that it can follow an index, you’re almost home free! From here, there are a couple of fast possible paths to an answer:

 

  1. Just guess the rest of what you need to complete the query diagram. Enable the best execution plan that diagram points you to (if necessary, with new indexes and/or changes to the SQL), add hints to get that plan if necessary (and check that you really have that plan), and see how long the query runs. (I often replace the Select list with “rownum” so I can bail out part-way through with some idea of the progress made by the time I bailed out, if it turns out that the query runs longer than expected.) If the query turns out to be so fast that further improvement is irrelevant, I’m done! If it is slower than ideal, but still decently fast, I’ll move onto path #2, below. If it is really slower than expected, then I’ll go back and reexamine the assumptions and guesses that lead me to the choice.
  2. Build the plan step-by-step by adding joins one at a time to the table you think has the best filter, and at each step learn about the filter on that next table.

 

Here’s an illustration of the second approach, using a 4-way join:

 

Select …

From Orders O, Details D, TableX X, TableY Y

Where O.Customer_ID = :1

And O.Order_ID = D.Order_ID

And D.ColD = ‘A’

And O.X_ID = X.X_ID

And X.ColX = ‘B’

And O.Y_ID = Y.Y_ID

And Y.ColY = ‘C’;

 

We might not guess what ColD, ColX, or ColY represent, or how selective the filters on those columns are, but we’d likely guess that we have lots of customers, and the condition on O.Customer_ID is highly selective, and that is our starting point. If we’re being conservative, we might first get stats supporting our guess that a condition on Customer_ID is highly selective. Here’s a script, qdistr.sql, that I use regularly for that sort of purpose:

 

select num_distinct from all_tab_columns where column_name =

upper('&&1') and table_name = upper('&&2');

 

SQL> @qdistr customer_id orders

 

Would quickly give me the estimated number of distinct Customer_IDs, based on the last table analysis, even if the table is huge. If the table is not large, I might even get more-precise information with distr.sql:

 

set heading off

set lines 160

set feedback off

set verify off

set timing off

spool tmpdistr.sql

select 'select count(*), count('||'&&1'||'), count(distinct '||'&&1'||

') from '||'&&2'||';' from dual;

spool off

@tmpdistr

set timing on

 

This also gives me information about how often that column has any non-null value at all, which can be useful to know for columns that are extra selective just because they are so often null, and as a bonus I get a complete rowcount on the table at the same time.

 

As a next step, it’s nice to choose a real value for the bind variable :1 to assume for purposes of the rest of the tuning exercise. The best choice is generally one that is less selective than average, but not usually the very worst case. Let’s say that we found 50,000 distinct Customer_IDs, verified an index on that column (or created one, if needed) and ran

 

Select min(Customer_ID) from Orders;

 

to find the lowest one, 1001, with a quick index hit. A fast way to choose a good semi-conservative value to tune with would be

 

Select Customer_ID, count(*) from Orders where Customer_ID < 1001+1000

Group by Customer_ID order by 2;

 

This should run fast, off just the index on Customer_ID, and the last few rows it returns would be the worst-case customers out of up to the first thousand (and oldest customers tend to be biggest – for most other columns, you’d look at highest values), and you might go with the fifth-worst-case customer you see, or so, for a fairly conservative guess.

 

Let’s say you found from the estimation approaches above that Details had 30,000,000 rows, Orders had 10,000,000 rows, TableX had 2,000,000 rows, and TableY had 1,000,000 rows. Let’s further stipulate that all foreign keys are not-null and have perfect referential integrity, and that the fifth-worst value, 1547, of Customer_ID that you found < 2001 returned 20,000 rows. In this case, your query diagram, so far, would be:

 

D ?

| 3

|

| 1

V

O 0.002

|\

|  \

|5  \10

|      \

|1     \1

V      V

X ?    Y ?

 

 

The selectivities of the filters on D, X, and Y still matter in our effort to find the best join order, but we want to find those selectivities without full table scans on these large tables, and we suspect that the filters are not very selective, so they likely don’t have indexes and should not. We already know that

 

Select count(*) from Orders where Customer_ID=1547;

 

will return 20,000. Next, we can use this selective path to sample just 0.2% of the rows of D, and small fractions of the rows of X and Y as follows:

 

Select rn from

(Select /* ordered use_nl(O X) */ rownum rn from Orders O, TableX X

where O.Customer_ID = 1547

and O.X_ID = X.X_ID

and X.ColX = ‘B’)

where mod(rn,10)=0;

 

The objective above is just to find out how much more the filter on ColX reduces the rowcount below the 20,000 rows coming from O, but note a couple of tricks I’ve used: First, I’m looking at rownum values, instead of a single count, so I can get results as they come, to give an idea of my rate of progress before getting the final result. For really long-running checks this can be a big help. Second, I’ve arranged an inline view that gives me only every tenth row from the inner query, so I don’t have to wait for potentially as many as 20,000 rows to scroll by, but I still get a result within 10 of the real answer. (I’d choose an even bigger number inside the mod() expression if I expected even more rows.) Third, I’ve gone ahead and added possibly-unneeded hints just so I don’t have to even worry about whether I’m going to get a fast plan to this statistical result. Let’s say that the above query returned 500 rows, with the number 5,000 as the biggest value of rn returned. In this case, we know that the filter X.ColX=’B’ must have discarded 75% of the 20,000 rows coming from Orders, and the filter ratio on X is 0.25. Note that we found this out just by sampling 20,000 rows from the 2-million-row TableX, because we ran a query driving from what we think is the best driving table. (It is possible that out of the 2-million rows of TableX, the filter has a different selectivity than it has among the row we’re sampling, but the right filter fraction to use is really the one we’re getting in this way I’m describing, if the filters are statistically correlated, since what we really want to know is how selective is the next filter *after* we reach the earlier filters!) An even faster sample would go as follows:

 

Select rownum from

(Select O.X_ID from

(Select rownum rn, O.X_ID from Orders O

where O.Customer_ID = 1547)

where mod(rn,10)=0) O, , TableX X

where O.X_ID = X.X_ID

and X.ColX = ‘B’;

 

Note that this approach filters 90% of the 20,000 rows meting the Customer_ID condition before doing the work of the join to TableX, so this could be almost as fast as just the single-table query to Orders. The drawback would come if the rowcount being joined to the second table was too small, resulting in a poor sample size.

 

Next, let’s find the filter ratio on Y, just sampling all the rows that survive the first two joins:

 

Select rn from

(Select /* ordered use_nl(O X Y) */ rownum rn from Orders O, TableX X, TableY Y

where O.Customer_ID = 1547

and O.X_ID = X.X_ID

and X.ColX = ‘B’

and Y.ColY = ‘C’)

where mod(rn,10)=0;

 

Lets say this returns 50 rows, with the highest value of rn returned equal to 500. In this case, we know that the filter on ColY discarded 90% of the 5000 rows coming from the join of O and X, so the filter ratio on Y is 0.1. Note that we learned this by sampling just 0.5% of the million rows in that table. Now that we know that the filter on Y is better than the filter on X, we know to go with a better join order run the final query:

 

Select /* ordered use_nl(O X Y D) */ rownum from Orders O, TableY Y, TableX X, Details D

Where O.Customer_ID = 1547

and O.X_ID = X.X_ID

and X.ColX = ‘B’

and Y.ColY = ‘C’

and O.Order_ID = D.Order_ID

and D.ColD=’A’;

 

If this came back with, for example, 50 rows, we’d know that the filter ratio on D was 0.01, and that the current join order (O, Y, X, D) was correct. In fact, the filter ratio on D is irrelevant to the choice of plan unless it turns out to be significantly better than the filter ratio on O, so unless the final query returned just a row or two, we’d stick with the current join order.

 

If we couldn’t even guess which filter might be selective, or suspected we had no really selective filter at all, we could still employ a sampling strategy such as above to progressively improve our guesses. For example, we’d start by checking the number of distinct values on any equality conditions, or the size of the ranges compared to the minimum and maximum values, for range conditions. Then, we might formulate an artificial query for tuning purposes, adding an extra selective filter just to make the tests run faster. For example, if the query was

 

Select …

From Orders O, Details D, TableX X, TableY Y

Where O.ColO = ‘D’

And O.Order_ID = D.Order_ID

And D.ColD = ‘A’

And O.X_ID = X.X_ID

And X.ColX = ‘B’

And O.Y_ID = Y.Y_ID

And Y.ColY = ‘C’;

 

We might just choose some small random range of Customer_IDs as an artificial driver to more or less randomly sample 1000th of the rows:

 

Select …

From Orders O, Details D, TableX X, TableY Y

Where O.ColO = ‘D’

And O.Order_ID = D.Order_ID

And D.ColD = ‘A’

And O.X_ID = X.X_ID

And X.ColX = ‘B’

And O.Y_ID = Y.Y_ID

And Y.ColY = ‘C’

And O.Customer_ID between 1001 and 1050;

 

We’d start by learning how many orders fall in that customer range, then add the filter on ColO, then add one table and its filter at a time to find the sampled filter ratios for each other table, until we had a picture of the whole query and its estimated filter ratios, then strip off the condition on Customer_ID to test the final choice of best plan for the real application query. At each stage, we’d just query for rownum (or every 10th, 100th,… rownum) so we could see early results as they appear, without waiting for the whole query to run.

Research Problems

Old Research Problems

To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.

New problem: Bailing out of Bad Plans

If a human being were to execute SQL manually, against a series of tables and indexes that the human could see, he or she would likely choose a good-looking path to the data and start to follow that path. If, however, the human found that the results of following that path violated the assumptions he or she made when choosing that path, and appeared likely to take much longer than originally expected, he or she would probably revisit the question of which execution plan was best and potentially would start over with a better plan before wasting too much time on the bad plan. Does any database do this? Propose a good algorithm for a database that would.

 

©2006 Dan Tow, All rights reserved

 

Home