A Taxonomy of Problematic SQL
There are recurring patterns in most bad SQL, and systematic
ways to recognize these patterns and avoid or fix the problems that map to
these patterns. Just as a zoologist needs taxonomy to classify and recognize
the species in an ecosystem, and to understand their relationships, a developer
needs a systematic method to understand the ways that SQL can be broken, and to
repair those problems, especially if the developer is maintaining someone
else’s SQL, often SQL from a source that cannot be reached to explain the
purpose behind the SQL and how it was written. Even without knowing the purpose
behind the SQL, certain patterns can point to recognizable, common problems
that need repair. This presentation, borrowing from a chapter in the author’s
book, SQL Tuning, describes a taxonomy of abnormal
SQL, SQL that often requires repair or points to opportunities for database
design improvements. (O’Reilly retains the copyright on the material directly
from the book.)
Many of the rules for classifying SQL are vastly easier to
express and understand in terms of join trees, abstract representations of the
SQL that I explain in much more detail in the book, SQL Tuning. In case you are
not already familiar with these, and lack a copy of the book, I’ll introduce
join trees briefly, here.
Consider a
query:
SELECT …
FROM Orders O, Order_Details OD, Products P,
Customers C,
Shipments S, Addresses A, Code_Translations ODT, Code_Translations OT
WHERE UPPER(C.Last_Name)
LIKE :Last_Name||'%'
ORDER BY …;
The join tree that abstractly represents this query reflects the
following:
The join tree that represents the above query, then, is as follows:

Figure 1, a join tree
representing the above query
In this diagram:
•
Each table is a node, represented by its alias.
•
Each join is a link, with (usually
downward-pointing) arrows pointing toward any side of the join that is unique.
•
Midpoint arrows point to optional side of any outer
join.
For example, the table Shipments
is represented by the node labeled “S”, and the outer join “S.Address_ID
= A.Address_ID(+)” is represented by the downward-pointing arrow from S to
A, with the midpoint arrow pointing toward the “(+)” side of that join and the
end-point arrow pointing toward A because ADDRESS_ID is unique for the table Addresses.
Following these rules to create
join diagrams, we find a number of regularities among most well-written SQL:
•
The query maps to one tree.
•
The tree has one root, exactly one table with no
join to its primary key.
•
All joins have downward-pointing arrows (joins
unique on one end).
•
Outer joins point down (toward the primary key, that is), with only outer joins below outer joins.
•
The question that the query answers is basically a
question about the entity represented at the top (root) of the tree (or about
aggregations of that entity).
•
The other tables just provide reference data stored
elsewhere for normalization.
If we know in advance that good
SQL tends to follow these patterns, the main taxonomic grouping of SQL should
be simply normal queries that precisely meet the above conditions. The rest of
our taxonomy will consist of query classes that fail to meet precisely these
conditions in specific ways. I will describe each of these special query
classes in terms of the atypical features as seen in the query diagrams,
themselves. These join trees provide a convenient shorthand for generalizing
the type of abnormality in a form far easier to recognize (once you really
understand join trees) than any description that focuses on the SQL, itself. In
each case, I will use a convention where I show the essential abnormal feature
of the abnormal diagram, but I hide (behind gray “clouds”) the parts of the
diagram that have no bearing on the abnormality. Each atypical feature will
tend to point to problems or opportunities for changes to the SQL or the
database design.

Figure 2, a case-1 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey2
and T1.fkey1=T3.pkey3,
with T2.pkey2=T3.pkey3, explicitly also
in the WHERE clause, or just implied by transitivity. This type of join tree is
an opportunity for optimization, not a problem, allowing all possible join
orders between these three tables. Depending on the inclusion or exclusion of
conditions implied by transitivity, the direct representation of the SQL as
written might look like any of the following three subcases:

Figure 3, alternative join
trees where the cyclic join graph may just be implied
Subcases A, B, and C show where the missing
link in the graphs may only be implied by transitivity. In subcases
A and B, it is easy to see from the diagram alone that the foreign key pointing
to T3 or T2 must equally point to the primary key of the other table, since
those two tables, joining one-to-one, presumably share primary key values. In subcase C, the diagram alone does not imply a hidden cycle
– you must notice in the SQL itself that the foreign key pointing from T1 to T2
is the same foreign key that points
from T1 to T3.
Note that although Case 1 of cyclic join trees does not
automatically create a problem, this case does involve a one-to-one join, which
we discuss more, later.

Figure 4, a case-2 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey2 and T1.fkey2=T3.pkey3 and
T2.fkey2=T3.pkey3.
Note that you will almost certainly find that
T1.fkey2=T2.fkey2, systematically, whenever T2 is a master row to the T1
detail, in other words, wherever T1.fkey1=T2.pkey2. This is because fkey2 in T1
is actually a denormalization, here, for example, a Customer_ID column in Order_Lines,
when the normalized value of Customer_ID belongs in
Orders. This denormalization, like most denormalizations, is likely not needed (that is, the fkey2
column likely should be dropped from T1), and it certainly isn’t likely that
both the join to the normalized foreign key and the denormalized
copy of that key is needed from a functional perspective. The extra denormalized join, here is likely to fool the optimizer
into expecting that almost no rows will satisfy all three joins, since the
optimizer views these joins as statistically independent conditions (which is
the opposite of the truth). This false expectation is likely to lead the
optimizer to make bad choices, since it will likely drastically underestimate
how many rows will remain after performing all three joins.

Figure 5, a case-3 cyclic
join graph
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey1 and
T1.fkey2=T3.pkey2, and T2.col3=T3.col4.
Though this last condition is
technically a join, it is best to treat the last join as a constrained filter
condition that we cannot evaluate without both T2 and T3. The constrained
filter condition requires that you join to both tables to pick up the filter
selectivity. In choosing a join order, I generally ignore these constrained
filters until one of the tables in the combined filter is reached without
regard to the constrained filter, then treat the
filter as a single-table filter on the other table.

Figure 6, a case-4 cyclic
join tree
In this
query type, you should find somewhere in the WHERE clause joins such as:
T1.fkey1=T2.pkey1 and T1.fkey2=T3.pkey_col1, and
T2.col=T3.pkey_col2.
Note that T3 has a two-part primary key, here, and that the
joins to the two-part key come from two different tables that are themselves
joined in a master-detail relationship. This is uncommon, but happens sometimes,
especially where the database tries to take advantage of “natural” keys, keys
built from columns having some external meaning (as opposed to just being
arbitrary numbers generated from some sequence). I don’t generally recommend
this practice of using non-arbitrary primary key columns, but SQL developers
often have no control over these design decisions, so you may find yourself
with cases like this. Treat both joins to T3 as a constrained join – don’t join
to T3 until you have reached both T1
and T2 in the join order. For purposes of diagramming, treat the join to T3 as
a single arrow having three end points:

Figure 7, a case-4 cyclic
join tree, transformed to show the 3-ended unique join
To make this case concrete, here
is an example of a query against a hypothetical data dictionary that would see
this case:
SELECT TC.Column_Name
FROM
Indexes
WHERE Ind.Index_Name=’EMPLOYEES_X1’
ORDER BY IC.Index_Position ASC
Note that the database designer
has chosen to use the non-arbitrary value of Column_Number
(which indicates where in the column order for the table the column lies, a
meaningful property) to serve as the second part of a two-part key for the Table_Columns table. Since it would be a denormalization to store the Table_ID
(which is an arbitrary key, to the
Tables table) in the Index_Columns table, the full
unique join to Table_Columns must combine Table_ID, from the Indexes table and the (non-arbitrary) Column_Number from the Index_Columns
table. If, on the other hand, the design simply used a sequence-generated
one-part Column_ID for the primary key of Table_Columns, then we would have a simple many-to-one join
pointing from Index_Columns to Table_Columns,
and there would be no need for a join between Indexes and Table_Columns
(although each might share a matching foreign key pointing to Table_ID, still).

Figure 8, multiple trees:
sub-graphs with no join between them
Here, treat each tree (or single, isolated node) as an
independent query, and find how many rows each of these independent queries
would return. (In other words, produce a query that strips out all references
(including references in the FROM clause) to T2 and any table that joins,
directly or indirectly, to T2, and a separate query that strips out all
references to T1 any table that joins, directly or indirectly, to T1.)
Considering these separate queries, there are three cases to consider:
In case 1, the full query can be seen as two single-row
queries cleverly combined to return a longer single row. From a performance
perspective, this is harmless, or even very slightly helpful (since it avoids a
round trip to the database) versus the alternative of separate queries. It may
be somewhat confusing from a code-maintenance perspective though.
In case two, the original query returns a Cartesian product
between a single-row set and a multi-row set, which has exactly the same number
of result rows as the multi-row set. This saves a round trip to the database
versus the result if we used the separate queries, and it should not require
extra work inside the database, if the optimizer makes the right choices, but
it also means that the expressions queried for the single-row set (in the
SELECT list) are passed across the network multiple times, once for each row in
the multi-row set, using somewhat more bandwidth. Neither choice, neither the
original query nor the separated queries, is likely to be significantly better
than the other choice from a performance perspective, as long as the optimizer
makes the right choices. Separated queries may be clearer from a
code-maintenance perspective, though.
In case 3, the original query will at least sometimes return
a Cartesian product of two multi-row sets, all possible combinations of the
members of those two sets, that is. If the multi-row sets are large, this can
be disastrously expensive and slow, compared to the alternative of returning
each set in a separate, normal query. It is also likely that the code to
correctly handle the Cartesian product is more complex than the code that would
handle the separated queries. Three solutions are likely:

Figure 9, a tree with
multiple roots
Consider the query for Figure 9,
above, and imagine that we begin the execution plan by reading a row of the
table Master. Here I’ve introduced a new convention, the join ratios, shown as
numbers next to the join arrows. In each case, the number compares the rowcounts before and after performing that join. Thus, for
example, if we found 100 rows from Master, we would expect 3000 rows after
joining next to Root2, or 500 rows if we joined 100 rows from Master to Root1.
(On the unique end of a join, the join ratio is at most 1, less if the join
doesn’t always succeed.) Consider what happens to each row from Master, on
average – when it joins to Root1, the rowcount is
multiplied by 5, and when it joins to Root2, the rowcount
is multiplied by 30. The result of joining Master to both Root1 and Root2, we find an average of 150
(5*30) combinations of the 5 rows
from Root1 and the 30 rows from Root2 that match each row from Master – a
Cartesian product matched to each row we read from Master. These
query result rows do not map one-to-one to any single table or business entity,
but rather to combinations of Root1 entities and Root2 entities, and it is
unlikely that this is a useful business result. Even if it is in some way a
useful result, it contains no more data than you would find by separately
reading the 5 matches from Root1 for each Master row, and the 30 matches from Root2
for each Master row, which is surely less work than reading the 150
combinations. There are two likely solutions: