A Taxonomy of Problematic SQL

Introduction

There are recurring patterns in most bad SQL, and systematic ways to recognize these patterns and avoid or fix the problems that map to these patterns. Just as a zoologist needs taxonomy to classify and recognize the species in an ecosystem, and to understand their relationships, a developer needs a systematic method to understand the ways that SQL can be broken, and to repair those problems, especially if the developer is maintaining someone else’s SQL, often SQL from a source that cannot be reached to explain the purpose behind the SQL and how it was written. Even without knowing the purpose behind the SQL, certain patterns can point to recognizable, common problems that need repair. This presentation, borrowing from a chapter in the author’s book, SQL Tuning, describes a taxonomy of abnormal SQL, SQL that often requires repair or points to opportunities for database design improvements. (O’Reilly retains the copyright on the material directly from the book.)

Join trees, a crash introduction

Many of the rules for classifying SQL are vastly easier to express and understand in terms of join trees, abstract representations of the SQL that I explain in much more detail in the book, SQL Tuning. In case you are not already familiar with these, and lack a copy of the book, I’ll introduce join trees briefly, here.

 

Consider a query:

 

SELECT

FROM Orders O, Order_Details OD, Products P, Customers C,

        Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT

WHERE UPPER(C.Last_Name) LIKE :Last_Name||'%'

  AND UPPER(C.First_Name) LIKE :First_Name||'%'

  AND OD.Order_ID = O.Order_ID

  AND O.Customer_ID = C.Customer_ID

  AND OD.Product_ID = P.Product_ID(+)

  AND OD.Shipment_ID = S.Shipment_ID(+)

  AND S.Address_ID = A.Address_ID(+)

  AND O.Status_Code = OT.Code

  AND OT.Code_Type = 'ORDER_STATUS'

  AND OD.Status_Code = ODT.Code

  AND ODT.Code_Type = 'ORDER_DETAIL_STATUS'

  AND O.Order_Date > :Now - 366

ORDER BY …;

 

The join tree that abstractly represents this query reflects the following:

 

  • What tables are being joined, represented as aliases, since the same table can appear multiple times in the FROM clause under different aliases.
  • What is the nature of each join – which side or sides of the join are unique, and what joins to what.
  • Which joins are outer joins, and which side of each outer join is the optional side (the side with the “(+)” in old-style Oracle join notation).

 

The join tree that represents the above query, then, is as follows:

 

Figure 1, a join tree representing the above query

 

In this diagram:

 

          Each table is a node, represented by its alias.

          Each join is a link, with (usually downward-pointing) arrows pointing toward any side of the join that is unique.

          Midpoint arrows point to optional side of any outer join.

 

For example, the table Shipments is represented by the node labeled “S”, and the outer join “S.Address_ID = A.Address_ID(+)” is represented by the downward-pointing arrow from S to A, with the midpoint arrow pointing toward the “(+)” side of that join and the end-point arrow pointing toward A because ADDRESS_ID is unique for the table Addresses.

 

Following these rules to create join diagrams, we find a number of regularities among most well-written SQL:

 

          The query maps to one tree.

          The tree has one root, exactly one table with no join to its primary key.

          All joins have downward-pointing arrows (joins unique on one end).

          Outer joins point down (toward the primary key, that is), with only outer joins below outer joins.

          The question that the query answers is basically a question about the entity represented at the top (root) of the tree (or about aggregations of that entity).

          The other tables just provide reference data stored elsewhere for normalization.

 

If we know in advance that good SQL tends to follow these patterns, the main taxonomic grouping of SQL should be simply normal queries that precisely meet the above conditions. The rest of our taxonomy will consist of query classes that fail to meet precisely these conditions in specific ways. I will describe each of these special query classes in terms of the atypical features as seen in the query diagrams, themselves. These join trees provide a convenient shorthand for generalizing the type of abnormality in a form far easier to recognize (once you really understand join trees) than any description that focuses on the SQL, itself. In each case, I will use a convention where I show the essential abnormal feature of the abnormal diagram, but I hide (behind gray “clouds”) the parts of the diagram that have no bearing on the abnormality. Each atypical feature will tend to point to problems or opportunities for changes to the SQL or the database design.

Cyclic join graphs (not trees), case 1

 

 

Figure 2, a case-1 cyclic join graph

 

 

In this query type, you should find somewhere in the WHERE clause joins such as:

 

T1.fkey1=T2.pkey2 and T1.fkey1=T3.pkey3,

 

with T2.pkey2=T3.pkey3, explicitly also in the WHERE clause, or just implied by transitivity. This type of join tree is an opportunity for optimization, not a problem, allowing all possible join orders between these three tables. Depending on the inclusion or exclusion of conditions implied by transitivity, the direct representation of the SQL as written might look like any of the following three subcases:

 

 

Figure 3, alternative join trees where the cyclic join graph may just be implied

 

Subcases A, B, and C show where the missing link in the graphs may only be implied by transitivity. In subcases A and B, it is easy to see from the diagram alone that the foreign key pointing to T3 or T2 must equally point to the primary key of the other table, since those two tables, joining one-to-one, presumably share primary key values. In subcase C, the diagram alone does not imply a hidden cycle – you must notice in the SQL itself that the foreign key pointing from T1 to T2 is the same foreign key that points from T1 to T3.

 

Note that although Case 1 of cyclic join trees does not automatically create a problem, this case does involve a one-to-one join, which we discuss more, later.

Cyclic join graphs (not trees), case 2

Figure 4, a case-2 cyclic join graph

 

In this query type, you should find somewhere in the WHERE clause joins such as:

 

T1.fkey1=T2.pkey2 and T1.fkey2=T3.pkey3 and T2.fkey2=T3.pkey3.

 

Note that you will almost certainly find that T1.fkey2=T2.fkey2, systematically, whenever T2 is a master row to the T1 detail, in other words, wherever T1.fkey1=T2.pkey2. This is because fkey2 in T1 is actually a denormalization, here, for example, a Customer_ID column in Order_Lines, when the normalized value of Customer_ID belongs in Orders. This denormalization, like most denormalizations, is likely not needed (that is, the fkey2 column likely should be dropped from T1), and it certainly isn’t likely that both the join to the normalized foreign key and the denormalized copy of that key is needed from a functional perspective. The extra denormalized join, here is likely to fool the optimizer into expecting that almost no rows will satisfy all three joins, since the optimizer views these joins as statistically independent conditions (which is the opposite of the truth). This false expectation is likely to lead the optimizer to make bad choices, since it will likely drastically underestimate how many rows will remain after performing all three joins.

Cyclic join graphs (not trees), case 3

Figure 5, a case-3 cyclic join graph

 

In this query type, you should find somewhere in the WHERE clause joins such as:

 

T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey2, and T2.col3=T3.col4.

 

Though this last condition is technically a join, it is best to treat the last join as a constrained filter condition that we cannot evaluate without both T2 and T3. The constrained filter condition requires that you join to both tables to pick up the filter selectivity. In choosing a join order, I generally ignore these constrained filters until one of the tables in the combined filter is reached without regard to the constrained filter, then treat the filter as a single-table filter on the other table.

Cyclic join graphs (not trees), Case 4

 

Figure 6, a case-4 cyclic join tree

 

In this query type, you should find somewhere in the WHERE clause joins such as:

 

T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey_col1, and T2.col=T3.pkey_col2.

 

Note that T3 has a two-part primary key, here, and that the joins to the two-part key come from two different tables that are themselves joined in a master-detail relationship. This is uncommon, but happens sometimes, especially where the database tries to take advantage of “natural” keys, keys built from columns having some external meaning (as opposed to just being arbitrary numbers generated from some sequence). I don’t generally recommend this practice of using non-arbitrary primary key columns, but SQL developers often have no control over these design decisions, so you may find yourself with cases like this. Treat both joins to T3 as a constrained join – don’t join to T3 until you have reached both T1 and T2 in the join order. For purposes of diagramming, treat the join to T3 as a single arrow having three end points:

 

 

Figure 7, a case-4 cyclic join tree, transformed to show the 3-ended unique join

 

To make this case concrete, here is an example of a query against a hypothetical data dictionary that would see this case:

 

SELECT TC.Column_Name FROM

Indexes Ind, Index_Columns IC, Table_Columns TC

WHERE Ind.Index_Name=’EMPLOYEES_X1’

AND Ind.Index_ID=IC.Index_ID

AND Ind.Table_ID=TC.Table_ID

AND IC.Column_Number=TC.Column_Number

ORDER BY IC.Index_Position ASC

 

Note that the database designer has chosen to use the non-arbitrary value of Column_Number (which indicates where in the column order for the table the column lies, a meaningful property) to serve as the second part of a two-part key for the Table_Columns table. Since it would be a denormalization to store the Table_ID (which is an arbitrary key, to the Tables table) in the Index_Columns table, the full unique join to Table_Columns must combine Table_ID, from the Indexes table and the (non-arbitrary) Column_Number from the Index_Columns table. If, on the other hand, the design simply used a sequence-generated one-part Column_ID for the primary key of Table_Columns, then we would have a simple many-to-one join pointing from Index_Columns to Table_Columns, and there would be no need for a join between Indexes and Table_Columns (although each might share a matching foreign key pointing to Table_ID, still).

Queries with multiple, disconnected trees

 

Figure 8, multiple trees: sub-graphs with no join between them

 

Here, treat each tree (or single, isolated node) as an independent query, and find how many rows each of these independent queries would return. (In other words, produce a query that strips out all references (including references in the FROM clause) to T2 and any table that joins, directly or indirectly, to T2, and a separate query that strips out all references to T1 any table that joins, directly or indirectly, to T1.) Considering these separate queries, there are three cases to consider:

  1. Both queries return at most one row.
  2. One query returns at most one row, and the other at least sometimes returns multiple rows.
  3. Both queries can return multiple rows.

 

In case 1, the full query can be seen as two single-row queries cleverly combined to return a longer single row. From a performance perspective, this is harmless, or even very slightly helpful (since it avoids a round trip to the database) versus the alternative of separate queries. It may be somewhat confusing from a code-maintenance perspective though.

 

In case two, the original query returns a Cartesian product between a single-row set and a multi-row set, which has exactly the same number of result rows as the multi-row set. This saves a round trip to the database versus the result if we used the separate queries, and it should not require extra work inside the database, if the optimizer makes the right choices, but it also means that the expressions queried for the single-row set (in the SELECT list) are passed across the network multiple times, once for each row in the multi-row set, using somewhat more bandwidth. Neither choice, neither the original query nor the separated queries, is likely to be significantly better than the other choice from a performance perspective, as long as the optimizer makes the right choices. Separated queries may be clearer from a code-maintenance perspective, though.

 

In case 3, the original query will at least sometimes return a Cartesian product of two multi-row sets, all possible combinations of the members of those two sets, that is. If the multi-row sets are large, this can be disastrously expensive and slow, compared to the alternative of returning each set in a separate, normal query. It is also likely that the code to correctly handle the Cartesian product is more complex than the code that would handle the separated queries. Three solutions are likely:

  1. Separate the queries, and modify the applications code (simplify it, likely) to handle the sets separately.
  2. Examine the original query looking for missing joins. The most common reason for these Cartesian-product queries is that the developer has simply left out a join condition that would convert the two separate join trees into a single normal tree, some join from a foreign key in a node of one tree into the primary key of the top node in the other tree. (If the join is not into the top node of the other tree, then we’ll be left with two roots!)
  3. You may find that none of the columns of one of the trees is even referenced in the query SELECT list, and the SELECT list begins with the DISTINCT keyword. In this case, almost certainly, you can simple remove all references to the unneeded tree from both the FROM clause and the WHERE clause, and eliminate the DISTINCT operator (which was likely added just to deal with all those pesky duplicates that resulted from the unintended Cartesian product with the unreferenced set). This case sometimes happens as a query evolves and some of the master-table lookups are no longer found to be needed, but the developer fails to remove these master tables from the WHERE clause.

Trees with multiple roots

 

Figure 9, a tree with multiple roots

 

Consider the query for Figure 9, above, and imagine that we begin the execution plan by reading a row of the table Master. Here I’ve introduced a new convention, the join ratios, shown as numbers next to the join arrows. In each case, the number compares the rowcounts before and after performing that join. Thus, for example, if we found 100 rows from Master, we would expect 3000 rows after joining next to Root2, or 500 rows if we joined 100 rows from Master to Root1. (On the unique end of a join, the join ratio is at most 1, less if the join doesn’t always succeed.) Consider what happens to each row from Master, on average – when it joins to Root1, the rowcount is multiplied by 5, and when it joins to Root2, the rowcount is multiplied by 30. The result of joining Master to both Root1 and Root2, we find an average of 150 (5*30) combinations of the 5 rows from Root1 and the 30 rows from Root2 that match each row from Master – a Cartesian product matched to each row we read from Master. These query result rows do not map one-to-one to any single table or business entity, but rather to combinations of Root1 entities and Root2 entities, and it is unlikely that this is a useful business result. Even if it is in some way a useful result, it contains no more data than you would find by separately reading the 5 matches from Root1 for each Master row, and the 30  matches from Root2 for each Master row, which is surely less work than reading the 150 combinations. There are two likely solutions: