Semi-Annual SingingSQL Newsletter #7 

July 10, 2008

 

Introduction

I hope you won’t object to a bit of personal business unrelated to SQL tuning: Many of my friends in SQL tuning enjoy the field because they find SQL tuning, as I do, to be an interesting puzzle. If that means you like puzzles in general, like I do, then you  might like my new invention, Suduro, a variant on Sudoku, for those who want a bit of extra challenge beyond the usual Sudoku. If you might be interested, check out my new online Suduro site, and years from now, when Suduro is published in thousands of newspapers daily ;-), you can say you were one of the very first to solve one!

 

You may have noticed that these “quarterly” newsletters aren’t exactly making it out 4 times/year – I’ve been swamped, so I’m going to try to get these out at least twice a year, returning to a quarterly schedule if time allows – better to have good material twice a year than something hacked together more often, I hope.

 

If you wish to unsubscribe at any time, just drop me a line at dantow@singingsql.com.

 

I’m taking a lowest-common-denominator approach to the format of this newsletter, sending it as a straight-ASCII note so anyone at all can access it securely, but if you’d rather read something more formatted, go to OnlineNewsletter07. I’ll also have links scattered throughout for supplementary materials.

 

Although this material is copyrighted, feel free to forward this newsletter, as is, to any friends who you think might be interested.

 

If you have any comments, ideas for future newsletter content, or suggestions, feel free to drop me a line, at dantow@singingsql.com – I’m always happy to hear from you!

SingingSQL News of Public Events

Old Old News

To see a cumulative listing of all news, including old news from past newsletters, see all news.

Old news

New news

  • I am offering my class, Top-Down SQL Tuning, soon, July 24-25, 2008 in Palo Alto (near San Francisco), California. Slots are available, at $1000/head – let me know if you want to attend, or if you want more information.
  • I’ve had a few people ask about coming class offerings, and it occurred to me that I don’t have a good way to announce them if they come up suddenly, because I want to keep my promise that this newsletter only comes just four times a year. Therefore, I’m starting a new list for anyone interested in announcements of classes and other offerings (such as presentations to local OAUGs) that might come up without enough warning to make it into this newsletter. If you are interested (and there is certainly no commitment in this, on your part) just drop me a line at dantow@singingsql.com, and I’ll add you to this new list. Please be sure to include your location and some range of area within which you’d consider attending something if I happened by. The area can be as broad as a continent or as narrow as a town – “Let me know if anything comes up in Asia.” to “Let me know if you have a class coming to Darwin, Australia.” Even if you’re off the beaten path, we might get lucky. I won’t send notes for this second list except to the people in the area where I’m offering something, and I doubt you’ll see more than a note or two from this new list per year, even if your area is broad.

Links

Old Links

To see a cumulative listing of all links, including links already listed in past newsletters, see all links.

New Links

See the links above under Old News.

Featured Article

Views, from Forty-Thousand Feet

 

From my perspective of as a SQL tuning consultant, I have to give three cheers for views. I’d estimate that 75% of my time spent tuning really tricky, non-routine SQL tuning problems is consumed dealing with complex SQL built on complex views – these are the problems that can consume hours, even days, of the best experts in the business, assuring continued, lucrative employment for professionals like myself. Of course, if you are on the employer’s side, looking for efficiency in a development organization, or if you are an overworked developer on a fixed, non-hourly salary, your perspective may be a little different!

 

SQL is a fabulously flexible query language, but this flexibility is often a curse, as it spools out rope to hang yourself on a plausible-looking query that returns pure nonsense, or worse, that appears to return the desired results, in a simple test, but fails in a dozen or more corner cases that the test fails to exercise. Such nonsense SQL is easy enough to construct when dealing with a simple, flat join of tables, but it becomes almost a certainty when dealing casually with complex SQL built on top of complex views.

 

A business application database stores data about all the business entities relevant to that application. Except in the rarest of cases, a query should return rows that represent a single type of entity, mapping rigorously one-to-one to instances of that entity (or to aggregations of that entity, in the cases of group-by queries). If the entity is important enough for a query, it is almost certainly represented by a physical table, so the query result, however many tables are in the FROM clause, should map rigorously one-to-one to a subset of rows from a single, physical table (which I’ll call the “primary” table, here, but which I call the “root detail table” in my book). All the other tables or views mentioned in the query should simply provide lookup data (data pertaining to the primary table, master data under the primary detail data, stored in master tables since the database is normalized) or data necessary to filter out primary-table rows that don’t belong in the desired, queried subset. The actual best path to the data might be complex, but one very simple path (which might not be the fastest) should almost always be possible; reach the rows of the primary table, first, all of them, if necessary, then reach every single other table (with the possible exception of tables in subquery conditions such as EXISTS and NOT EXISTS) through its primary key, from one or more joins beginning with each row of the primary table. This can be shown as a tree, with the primary table at the root, as I elaborate in my book, SQL Tuning, and it can be expressed easily with a simple, flat query against simple tables.

 

Now, here is the point that gets lost when predefined and inline views come into play: use of views doesn’t significantly expand the universe of queries that make sense, which can be expressed simply with a flat set of joins to simple tables; complex views just make it easier to write nonsense queries that do not belong to the set of queries that make sense, and they make it simpler to obscure the fact that the query is nonsense, and they make it much harder to fix the query when it must finally be fixed, and they make it much more likely that the query will be slow as molasses, and will require the services of a top SQL tuner!

 

Here are some rules for using views relatively safely:

1.      The view-defining query must, itself, be a sensible query, with results rigorously mapping one-to-one to a single physical table, the “primary table”, or to aggregations of rows of that table, in the case of a group-by view. It should be obvious what that primary table is, if necessary through explicit comments built into the view-defining query.

  1. If you are going to join to the view from another, more-detailed view or a more-detailed table, then the view should include as one (or more) of its columns the primary key of the view, normally the primary key of the primary table, and the join should be from a foreign key to that complete primary key.
  2. The view-defining query should alias each selected expression to the name of the view column, unless the expression is a simple column that already has the same name as the corresponding view column. This saves immense confusion in figuring how to map the view-based query to an equivalent query against simple tables.
  3. Where there are inner joins in the view-defining query, these should be commented so that it is clear whether the inner join serves a necessary filtering function, deliberately discarding rows that fail the join, or the join is inner simply because it is always expected to succeed. This makes it clear whether the join can be removed when writing the equivalent query to simple tables, if the join only serves to reach reference data in the view that is not needed in the final query.
  4. There are two ways to use views safely in view-using queries (and these include view-defining queries built on top of other views!):
    1. View-using queries may safely use a “primary” view in place of the primary table. Rows from these view-using queries (or aggregations of these rows, in a group-by query) must map precisely one-to-one to a subset of rows returned by the unfiltered primary-view-defining query, which in turn returns rows mapping one-to-one to rows of its own primary table. Joins from the primary view to joined-to tables or views must be from a foreign key returned by the view-defining query to a primary key in the joined-to table or view. If there is a view (the primary view) that does not join to other tables or views in the FROM clause though that view’s primary key, it must be the only table or view not joined through its primary key.
    2. View-using queries may use views to return lookup data, as joined-to views that are reached though a join to the joined-to view’s “view primary key.”

 

When you can’t follow these rules, and still get the necessary result, this is usually a clue that the schema design is broken, and should be fixed if possible before it does more damage and is even harder to disentangle.

 

When you don’t follow these rules, but you could, this is surely a clue that you are building a query that will be hard to maintain, and that is all-too-likely functionally wrong, today, or will become functionally wrong as the application and the data evolve. Queries built that fail to follow these rules are very likely to provide many hour of profitable employment for SQL tuning specialists like myself, as they will likely perform very poorly, and they are horribly difficult to repair safely. While this is a benefit for SQL tuning consultants paid by the hour, it is not so good for anyone else.

 

The rules for using views safely and correctly are complicated and hard to use, without a doubt, particularly considering that most people using views are doing so in search of simplicity in coding, not in search of complexity! Casual use of complex views, without regard for the above rules, may seem simple, but it almost invariably creates horribly complex problems in the future. If the rules to use complex views correctly are prohibitively difficult, this is likely a sign that the query, however difficult, will be easier to write and maintain correctly against base tables (or at most against simple views) than against complex views!

 

If the only objective is to pass a simple functionality test against a small, artificial development database, especially if the data schema is flawed, then a query hacked together with complex views, without regard to the above rules, is often the simplest solution. Most likely, however, such a hacked-together query will fail in current or future production corner cases unanticipated by the simple test cases against the simple, artificial data. These functional failures may be noticed soon and repaired later at high cost. Worse still, they may not be noticed for a long time, and the company will simply make ultimately expensive business mistakes for months or years based on the unnoticed functional errors! Frequently, such a hacked-together query will perform badly against production data volumes, and it will be horribly hard to fix. By the time such a query needs fixing, to resolve performance problems, to fix corner-case errors, to cope with schema changes, or to modify functionality, as the application evolves, the person fixing the query faces a horrendously difficult problem: what information, exactly, did the person who wrote the original query intend for the query to return, and does the intended result actually match the actual result (which may be extremely hard to decipher under all those layers of indirection provided by the views), and how may the query be modified to deliver the correct result, which may or may not match either the intended result (which is not clear from the actual query) or the actual result (which is also hard to decipher)? Frequently, by the time such a query needs to be fixed, it is very hard even to find anyone who even understands what the query needs to return to meet the business need – the query itself, however flawed and likely incorrect, is the only readily available “specification” for the rows that the application apparently requires at that point in the application flow of control, and a major project is necessary to work out the correct, undocumented requirement from scratch!  A well-written query, on the other hand, may be much more work to write at first, but it delivers a result that is much more transparent to understand and modify, and that will likely perform much better, with results more likely matching the actual results needed, even in obscure corner cases that might never be tested.

Research Problems

Old Research Problems

To see a cumulative listing of all research problems (the one below, and the earlier ones from the earlier newsletters), see all research problems. There has been no solution submitted yet for the earlier-published research problems.

New problem: Using Declared Denormalizations to Refine Cost Estimates

Most databases allow definition of intricate constraints on tables. Where a database design has denormalizations, these ought to be formally declared, ideally, with constraints that preferably enforce that the denormalizations are followed rigorously. If this is done, is there any database that actually uses these defined constraints to refine cost and cardinality estimates by the cost based optimizer? For example, consider an example: We have a mutli-org implementation of an order entry system, where each transaction table includes an org_id column. Users belonging to a given organization can only see data having the appropriate org_id, using single-org views on top of each table that automatically add a condition “and tab.org_id = <myorg>” to every query for every transaction table. Thus, a query:

 

SELECT …

FROM Orders O, Order_Details OD, …

WHERE O.Order_Date > <last Sunday>

AND O.Order_ID = OD.ORDER_ID

AND

 

uses views Orders and Order_Details built on top of tables All_Orders and All_Order_Details making the original query equivalent to

 

SELECT …

FROM All_Orders O, All_Order_Details OD, …

WHERE O.Order_Date > <last Sunday>

AND O.Order_ID = OD.ORDER_ID

AND O.Org_ID = <myorg>

AND OD.Org_ID = <myorg>

AND

 

Let’s say that there are a hundred organizations. This query ought to return, on average, 1% of the orders (and their details) made since last Sunday, based on the two statistically independent filters on the Orders table. Assuming that the query drives from the Orders table, it will immediately pick up the condition filtering Order rows for only the desired organization. These orders, naturally, will join to Order_Details belonging to the same organization, so the view-imposed condition on OD.Org_ID is wholly redundant, and discards no further rows. If a constraint declares OD.Org_ID as a denormalization, inheriting it’s value from the parent order reached through the join on Order_ID, then the optimizer ought to know that all Order_Details reached though that join will already satisfy the filter on OD.Org_ID, so that filter will have no effect on the estimated cardinality. On the other hand, if the cost-based optimizer fails to take account of declared denormalizations such as this, it will fall back on its usual assumptions, and consider the condition on OD.Org_ID to be statistically independent of the other conditions, and it will then calculate that the joins shown, with the filter conditions, will return just 0.01% of the order details of orders placed since last Sunday, a hundred-fold error in the resulting cost calculation, with corresponding effects on the chances that the optimizer will find the true lowest-cost alternative.

 

Potentially, there would be valuable use for two sorts of constraint declarations, here: Constraints that are actively enforced by the database, and declared denormalizations that are (hopefully) enforced by the application, but not by the database, but that the optimizer is still instructed to accept on faith for purposes of calculating and comparing cardinalities and costs.

 

 

 

©2008 Dan Tow, All rights reserved

 

Home