Research Problems

2006-Q1: Stored Outlines Sensitive to Schema?

Is it possible to store two different outlines for the same SQL under the DEFAULT category, where that SQL is interpreted differently, depending on which SQL it is interpreted from, and to have each schema use the appropriate outline? For example, in one schema, "Orders," which is referenced in a query, might be a simple, local table, while in another schema, "Orders" is a UNION-ALL view of the local table and a remote-linked archive table, and the execution plans of the query must necessarily depend on the schema from which the query runs. If it is impossible to store two outlines under the DEFAULT category, what happens when the user attempts to execute the SQL, with user_stored_outlines=TRUE, from the wrong schema, the schema to which the outline does not apply? Does Oracle simply ignore the outline, or does it generate a plan that is wrong for that schema, even a plan that results in a wrong-rows bug? Categories seem to be a mechanism to let the process point to the appropriate set of outlines, depending on which schema the application attaches to, but this is inconvenient if the application has not been specifically written to know that it needs to ALTER SESSION to set user_stored_outlines=<The Correct Category for That Schema> when it connects to a particular schema. Is there some way to automate an ALTER SESSION command so that it happens automatically (some sort of trigger on connect?) when the schema is set or changed?

2006-Q2: Undocumented Hints

If you look at stored outlines in Oracle, it appears that Oracle has some undocumented hints that it uses, at least internally. What undocumented hints are available on each RDBMS (Oracle, SQL Server, et cetera), what do they do, and can they be used in user-created SQL, or just internally? (This is an open-ended problem, clearly, so I am not so much looking for a complete answer all at once as any information out there that represents progress toward an answer.)

2006-Q2: Ways to Use Wasted Resources

On most systems, vast system resources are simply wasted, in terms of disk space, memory, and CPU. This likely represents an opportunity, since these resources could be used in conservative ways that would never significantly affect ordinary load, while performing valuable background services. For example, during times of low CPU and I/O consumption, index blocks, in B*-tree indexes, could be continually, incrementally rebuilt, getting rid of enormous amounts of unused space (especially in indexes with common deletes and updates), resulting in more-compact, better-cached indexes without DBA intervention. As another example (which Oracle 10g already does, to a limited degree), post-execution optimization could be performed on high-resource-consumption SQL, automatically, in the background, to find better execution plans for that SQL the next time it executes.

2006-Q3: Bailing out of Bad Plans

If a human being were to execute SQL manually, against a series of tables and indexes that the human could see, he or she would likely choose a good-looking path to the data and start to follow that path. If, however, the human found that the results of following that path violated the assumptions he or she made when choosing that path, and appeared likely to take much longer than originally expected, he or she would probably revisit the question of which execution plan was best and potentially would start over with a better plan before wasting too much time on the bad plan. Does any database do this? Propose a good algorithm for a database that would.

2006-Q4: Opportunistic, Iterative Tuning

Here’s an extension of last quarter’s research problem, which was to consider cases when it might be worthwhile to halt and retune a query in the middle of its execution. Even if it never changed plans “midstream,” the optimizer could still gather information relevant to whether it should choose a new plan the next time that SQL executes. What sort of data should the database gather during execution pertaining to finding a better plan for a later execution. When should it bother? What sort of data should the database gather offline? When should it bother? How and when should it use that data to find a better plan for a later execution? (I have a number of ideas regarding possible answers, but I’d first like to open up the discussion.) Does any database besides Oracle10g (which has some capability along these lines) do something like this? Propose a good algorithm for a database that would. How well can we apply the answers to these questions to finding more effective methods to choose which SQL to tune manually and to improving the actual process of manual tuning?

2007-Q1: Handling Optimization with Links

Does Oracle in its very latest incarnations “look across the link” to see statistics from the other end of the link when optimizing the execution plan of a query that combines data on more than one database? Does any non-Oracle RDBMS do this?

Will any version of Oracle (or any other RDBMS) parse and optimize a raw (PL/SQL-function-free) query remotely, then pass the raw rows to the local database to perform any function calls and any GROUP BY operation that might depend on those function calls?

2007-Q2&Q3: Estimating Self-Caching for a Join to a Co-Clustered Table

What data would optimizers need to calculate the tendency for two specific tables queried on a specific filter to co-cluster, so that the optimizer could correctly estimate self-caching in the joined-to table? What would be the formula to apply that data to a correct estimate of the performance benefits of co-clustering??

2008-H1: Using Declared Denormalizations to Refine Cost Estimates

Most databases allow definition of intricate constraints on tables. Where a database design has denormalizations, these ought to be formally declared, ideally, with constraints that preferably enforce that the denormalizations are followed rigorously. If this is done, is there any database that actually uses these defined constraints to refine cost and cardinality estimates by the cost based optimizer? For example, consider an example: We have a mutli-org implementation of an order entry system, where each transaction table includes an org_id column. Users belonging to a given organization can only see data having the appropriate org_id, using single-org views on top of each table that automatically add a condition “and tab.org_id = <myorg>” to every query for every transaction table. Thus, a query:

SELECT …

FROM Orders O, Order_Details OD, …

WHERE O.Order_Date > <last Sunday>

AND O.Order_ID = OD.ORDER_ID

AND …

uses views Orders and Order_Details built on top of tables All_Orders and All_Order_Details making the original query equivalent to

SELECT …

FROM All_Orders O, All_Order_Details OD, …

WHERE O.Order_Date > <last Sunday>

AND O.Order_ID = OD.ORDER_ID

AND O.Org_ID = <myorg>

AND OD.Org_ID = <myorg>

AND …

Let’s say that there are a hundred organizations. This query ought to return, on average, 1% of the orders (and their details) made since last Sunday, based on the two statistically independent filters on the Orders table. Assuming that the query drives from the Orders table, it will immediately pick up the condition filtering Order rows for only the desired organization. These orders, naturally, will join to Order_Details belonging to the same organization, so the view-imposed condition on OD.Org_ID is wholly redundant, and discards no further rows. If a constraint declares OD.Org_ID as a denormalization, inheriting it’s value from the parent order reached through the join on Order_ID, then the optimizer ought to know that all Order_Details reached though that join will already satisfy the filter on OD.Org_ID, so that filter will have no effect on the estimated cardinality. On the other hand, if the cost-based optimizer fails to take account of declared denormalizations such as this, it will fall back on its usual assumptions, and consider the condition on OD.Org_ID to be statistically independent of the other conditions, and it will then calculate that the joins shown, with the filter conditions, will return just 0.01% of the order details of orders placed since last Sunday, a hundred-fold error in the resulting cost calculation, with corresponding effects on the chances that the optimizer will find the true lowest-cost alternative.

Potentially, there would be valuable use for two sorts of constraint declarations, here: Constraints that are actively enforced by the database, and declared denormalizations that are (hopefully) enforced by the application, but not by the database, but that the optimizer is still instructed to accept on faith for purposes of calculating and comparing cardinalities and costs.

2008-H2: Avoiding Non-Robust Cartesian Products

Occasionally, cost-based optimizers conclude that the most-efficient path to the data in a SQL query is to read two un-joined rowsets, independently, and to find the Cartesian product, that is, to find all possible combinations of the rows in the two sets, where the product, in general, has m*n rows, where m is the number of rows in the first set, and n is the number of rows in the second set. (After the Cartesian product is performed, there may be more joins, usually conventional joins done as nested loops or hash joins, and it is often the case that after these additional joins are performed, there will eventually be a link-up between the two sets originally joined by Cartesian product, where the later-joined tables ultimately join to each other and to both Cartesian-joined sets. The result of these later joins that join the two Cartesian sets to each other can discard many, or even most, of the rows joined in the intermediate results prior to completing the execution plan.

In general, I have found that these Cartesian joins look attractive to the optimizer when the optimizer estimates that one or both of the rowsets being joined has only a single row (or even less), on average. When at least one of the Cartesian joined sets does have just one row (or none, even better!), this works out nicely, in general. The problem is that when the estimate is wrong, even by a little bit, disaster can result. Consider the following hypothetical case:

SELECT …

FROM Order_Details OD, Orders O, Order_Sources OS

WHERE OD.Ship_Date>:yesterday

AND OD.Order_ID=O.Order_ID

AND O.Order_Source_ID=OS.Order_Source_ID

AND OS.Source_Country=’JAPAN’

AND OS.Source_Type=’RESELLER’

Let’s say that there are 10-million order details, with 1-thousandth of those shipped since yesterday (and the optimizer knows this), 1-million orders, and 500 order sources, with histograms reflecting the facts that 20 order sources have Source_Country=’JAPAN’ and 20 order sources have Source_Type=’RESELLER’. The optimizer makes its usual assumption of statistical independence between these two conditions, calculating that most likely there is at most one row meeting both conditions. Under this assumption, the best possible path to the data is to go first to the Order_Sources table, discarding all but one row (or discarding all rows, and halting execution right there!), then to read the 10000 recently-shipped order details with an index on Ship_Date, forming a Cartesian product between those details and the one Order_Source, then reach the orders through its primary key, with nested loops from the 10000 rows in the Cartesian product. If all goes well, that really is the best plan! What happens if just one assumption is a bit off, though? What if it happens that all the Japanese order sources are resellers? (Perhaps the business model in Japan is rather different than for the rest of the company’s operations!) Well, in this case, the Cartesian product will find 200-thousand combinations of Order_Detail and Order_Source, and the optimizer will perform 200-thousand nested loops to Orders. (These will reach any given order at least 20 times, possibly with pretty good caching, but surely with very wasteful logical I/O!) The plan is now a terrible idea, nowhere near as good as a simple plan following nested loops from Order_Details, to Orders, to Order_Sources, and worse still compared to a plan that pre-cached the rows from Order_Sources, then did a hash join of those rows to the result of the nested-loops join between the other two tables. Even if the correlation between these two conditions was only very weak, and there were just two Order_Sources meeting both conditions, the Cartesian product would be a dramatically bad idea.

So, simply, the research problem is how to avoid all the cases of harmful, non-robust Cartesian products, without losing their occasional benefits?

Home