Paul White

Incorrect Results with Merge Join

SentryOne Newsletters

The SQLPerformance.com bi-weekly newsletter keeps you up to speed on the most recent blog posts and forum discussions in the SQL Server community.

eNews is a bi-monthly newsletter with fun information about SentryOne, tips to help improve your productivity, and much more.

Subscribe

Featured Author

Jonathan Kehayias is a Principal Consultant with SQLskills and the youngest MCM ever.

Jonathan’s Posts

Every product has bugs, and SQL Server is no exception. Using product features in a slightly unusual way (or combining relatively new features together) is a great way to find them. Bugs can be interesting, and even educational, but perhaps some of the joys are lost when the discovery results in your pager going off at 4am, perhaps after a particularly social night out with friends…

The bug that is the subject of this post is probably reasonably rare in the wild, but it is not a classic edge case. I know of at least one consultant that has encountered it in a production system. On a completely unrelated subject, I should take this opportunity to say "hello" to the Grumpy Old DBA (blog).

I will start with some relevant background on merge joins. If you are certain you already know everything there is to know about merge join, or just want to cut to the chase, feel free to scroll down to the section titled, "The Bug."

Merge Join

Merge join is not a terribly complicated thing, and can be very efficient in the right circumstances. It requires that its inputs are sorted on the join keys, and performs best in one-to-many mode (where at least of its inputs is unique on the join keys). For moderately-sized one-to-many joins, serial merge join is not a bad choice at all, provided the input sorting requirements can be met without performing an explicit sort.

Avoiding a sort is most commonly achieved by exploiting the ordering provided by an index. Merge join can also take advantage of preserved sort order from an earlier, unavoidable sort. A cool thing about merge join is that it can stop processing input rows as soon as either input runs out of rows. One last thing: merge join does not care whether the input sort order is ascending or descending (though both inputs must be the same). The following example uses a standard Numbers table to illustrate most of the points above:

CREATE TABLE #T1 (col1 integer CONSTRAINT PK1 PRIMARY KEY (col1 DESC));
CREATE TABLE #T2 (col1 integer CONSTRAINT PK2 PRIMARY KEY (col1 DESC));

INSERT #T1 SELECT n FROM dbo.Numbers WHERE n BETWEEN 10000 AND 19999;
INSERT #T2 SELECT n FROM dbo.Numbers WHERE n BETWEEN 18000 AND 21999;

Notice that the indexes enforcing the primary keys on those two tables are defined as descending. The query plan for the INSERT has a number of interesting features:

INSERT execution plan

Reading left to right (as is only sensible!) the Clustered Index Insert has the "DML Request Sort" property set. This means the operator requires rows in Clustered Index key order. The clustered index (enforcing the primary key in this case) is defined as DESC, so rows with higher values are required to arrive first. The clustered index on my Numbers table is ASC, so the query optimizer avoids an explicit sort by seeking to the highest match in the Numbers table (21,999) first, then scanning toward the lowest match (18,000) in reverse index order. The "Plan Tree" view in SQL Sentry Plan Explorer shows the reverse (backward) scan clearly:

INSERT Plan Tree

Backward scanning reverses the natural order of the index. A backward scan of an ASC index key returns rows in descending key order; a backward scan of a DESC index key returns rows in ascending key order. The "scan direction" does not indicate returned key order by itself – you have to know whether the index is ASC or DESC to make that determination.

Using these test tables and data (T1 has 10,000 rows numbered from 10,000 to 19,999 inclusive; T2 has 4,000 rows numbered from 18,000 to 21,999) the following query joins the two tables together and returns results in descending order of both keys:

SELECT
    T1.col1,
    T2.col1
FROM #T1 AS T1 
JOIN #T2 AS T2 
    ON T2.col1 = T1.col1 
ORDER BY 
    T1.col1 DESC, 
    T2.col1 DESC;

The query returns the correct matching 2,000 rows as you would expect. The post-execution plan is as follows:

Basic Query Analysis

The Merge Join is not running in many-to-many mode (the top input is unique on the join keys) and the 2,000 row cardinality estimate is exactly correct. The Clustered Index Scan of table T2 is ordered (though we have to wait a moment to discover whether that order is forward or backward) and the cardinality estimate of 4,000 rows is also exactly right. The Clustered Index Scan of table T1 is also ordered, but only 2,001 rows were read whereas 10,000 were estimated. The plan tree view shows both Clustered Index Scans are ordered forward:

Plan Tree View

Recall that reading a DESC index FORWARD will produce rows in reverse key order. This is exactly what is required by the ORDER BY T1.col DESC, T2.col1 DESC clause, so no explicit sort is necessary. Pseudo-code for one-to-many Merge Join (reproduced from Craig Freedman's Merge Join blog) is:

Merge Join Algorithm

The descending order scan of T1 returns rows starting at 19,999 and working down towards 10,000. The descending order scan of T2 returns rows starting at 21,999 and working down towards 18,000. All 4,000 rows in T2 are eventually read, but the iterative merge process stops when key value 17,999 is read from T1, because T2 runs out of rows. Merge processing therefore completes without fully reading T1. It reads rows from 19,999 down to 17,999 inclusive; a total of 2,001 rows as shown in the execution plan above.

Feel free to re-run the test with ASC indexes instead, also changing the ORDER BY clause from DESC to ASC. The execution plan produced will be very similar, and no sorts will be needed.

To summarize the points that will be important in a moment, Merge Join requires join-key sorted inputs, but it does not matter whether the keys are sorted ascending or descending.

The Bug 

To reproduce the bug, at least one of our tables needs to be partitioned. To keep the results manageable, this example will use just a small number of rows, so the partitioning function needs small boundaries as well:

CREATE PARTITION FUNCTION PF (integer)
AS RANGE RIGHT
FOR VALUES (5, 10, 15);

CREATE PARTITION SCHEME PS
AS PARTITION PF
ALL TO ([PRIMARY]);


The first table contains two columns, and is partitioned on the PRIMARY KEY:

CREATE TABLE dbo.T1
(
    T1ID    integer IDENTITY (1,1) NOT NULL,
    SomeID  integer NOT NULL,

    CONSTRAINT [PK dbo.T1 T1ID]
        PRIMARY KEY CLUSTERED (T1ID)
        ON PS (T1ID)
);


The second table is not partitioned. It contains a primary key and a column that will join to the first table:

CREATE TABLE dbo.T2
(
    T2ID    integer IDENTITY (1,1) NOT NULL,
    T1ID    integer NOT NULL,

    CONSTRAINT [PK dbo.T2 T2ID]
        PRIMARY KEY CLUSTERED (T2ID)
        ON [PRIMARY]
);

The Sample Data

The first table has 14 rows, all with the same value in the SomeID column. SQL Server assigns the IDENTITY column values, numbered 1 to 14.

INSERT dbo.T1
    (SomeID)
VALUES
    (123), (123), (123),
    (123), (123), (123),
    (123), (123), (123),
    (123), (123), (123),
    (123), (123);


The second table is simply populated with the IDENTITY values from table one:

INSERT dbo.T2 (T1ID)
SELECT T1ID
FROM dbo.T1;

The data in the two tables looks like this:

Sample Data

The Test Query

The first query simply joins both tables, applying a single WHERE clause predicate (which happens to match all rows in this greatly simplified example):

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123;

The result contains all 14 rows, as expected:

Correct Output

Due to the small number of rows, the optimizer chooses a nested loops join plan for this query:

Nested Loops Plan

The results are the same (and still correct) if we force a hash or merge join:

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (HASH JOIN);

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

Hash and Merge Plans

The Merge Join there is one-to-many, with an explicit sort on T1ID required for table T2.

The Descending Index Problem

All is well until one day (for good reasons that need not concern us here) another administrator adds a descending index on the SomeID column of table 1:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC);


Our query continues to produce correct results when the optimizer chooses a Nested Loops or Hash Join, but it is a different story when a Merge Join is used. The following still uses a query hint to force the Merge Join, but this is just a consequence of the low row counts in the example. The optimizer would naturally choose the same Merge Join plan with different table data.

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

The execution plan is:

Merge Join Incorrect Results

The optimizer has chosen to use the new index, but the query now produces only five rows of output:

Incorrect Results Output

What happened to the other 9 rows? To be clear, this result is incorrect. The data has not changed, so all 14 rows should be returned (as they still are with a Nested Loops or Hash Join plan).

Cause and Explanation

The new nonclustered index on SomeID is not declared as unique, so the clustered index key is silently added to all nonclustered index levels. SQL Server adds the T1ID column (the clustered key) to the nonclustered index just as if we had created the index like this:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID);


Notice the lack of a DESC qualifier on the silently-added T1ID key. Index keys are ASC by default. This is not a problem in itself (though it contributes).  The second thing that happens to our index automatically is that it is partitioned in the same way the base table is. So, the full index specification, if we were to write it out explicitly, would be:

CREATE NONCLUSTERED INDEX [dbo.T1 SomeID]
ON dbo.T1 (SomeID DESC, T1ID ASC)
ON PS (T1ID);


This is now quite a complex structure, with keys in all sorts of different orders. It is complex enough for the query optimizer to get it wrong when reasoning about the sort order provided by the index. To illustrate, consider the following simple query:

SELECT 
    T1ID,
    PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID ASC;

The extra column will just show us which partition the current row belongs in. Otherwise, it is just a simple query that returns T1ID values in ascending order, WHERE SomeID = 123. Unfortunately, the results are not what is specified by the query:

Incorrect Results

The query requires that T1ID values should be returned in ascending order, but that is not what we get. We do get values in ascending order per partition, but the partitions themselves are returned in reverse order! If the partitions were returned in ascending order (and the T1ID values remained sorted within each partition as shown) the result would be correct.

The query plan shows that the optimizer was confused by the leading DESC key of the index, and thought it needed to read the partitions in reverse order for correct results:

Reversed Partitions

The partition seek starts at the right-most partition (4) and proceeds backwards to partition 1. You might think we could fix the issue by explicitly sorting on partition number ASC in the ORDER BY clause:

SELECT 
    T1ID,
    PartitionID = $PARTITION.PF(T1ID)
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    PartitionID ASC, -- New!
    T1ID ASC;

This query returns the same results (this is not a misprint or a copy/paste error):

Incorrect Results

The partition id is still in descending order (not ascending, as specified) and T1ID is only sorted ascending within each partition. Such is the optimizer's confusion, it really does think (take a deep breath now) that scanning the partitioned leading-descending-key index in a forward direction, but with partitions reversed, will result in the order specified by the query.

I don't blame it to be frank, the various sort order considerations make my head hurt too.

As a final example, consider:

SELECT 
    T1ID
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID DESC;

The results are:

Incorrect Results

Again, the T1ID sort order within each partition is correctly descending, but the partitions themselves are listed backward (they go from 1 to 3 down the rows). If the partitions were returned in reverse order, the results would correctly be 14, 13, 12, 11, 10, 9, … 5, 4, 3, 2, 1.

Back to the Merge Join

The cause of the incorrect results with the Merge Join query is now apparent:

SELECT
    T2.T2ID
FROM dbo.T1 AS T1
JOIN dbo.T2 AS T2
    ON T2.T1ID = T1.T1ID
WHERE
    T1.SomeID = 123
OPTION (MERGE JOIN);

Merge Revisited

The Merge Join requires sorted inputs. The input from T2 is explicitly sorted by T1TD so that is ok. The optimizer incorrectly reasons that the index on T1 can provide rows in T1ID order. As we have seen, this is not the case. The Index Seek produces the same output as a query we have already seen:

SELECT 
    T1ID
FROM dbo.T1
WHERE
    SomeID = 123
ORDER BY
    T1ID ASC;

Query Results

Only the first 5 rows are in T1ID order. The next value (5) is certainly not in ascending order, and the Merge Join interprets this as end-of-stream rather than producing an error (personally I expected a retail assertion here). Anyway, the effect is that the Merge Join incorrectly finishes processing early. As a reminder, the (incomplete) results are:

Incorrect Query Results

Conclusion

This is a very serious bug in my view. A simple index seek can return results that do not respect the ORDER BY clause. More to the point, the optimizer's internal reasoning is completely broken for partitioned non-unique nonclustered indexes with a descending leading key.

Yes, this is a slightly unusual arrangement. But, as we have seen, correct results can be suddenly replaced by incorrect results just because someone added a descending index. Remember the added index looked innocent enough: no explicit ASC/DESC key mismatch, and no explicit partitioning.

The bug is not limited to Merge Joins. Potentially any query that involves a partitioned table and which relies on index sort order (explicit or implicit) could fall victim. This bug exists in all versions of SQL Server from 2008 to 2014 CTP 1 inclusive. Windows SQL Azure Database does not support partitioning, so the issue does not arise. SQL Server 2005 used a different implementation model for partitioning (based on APPLY) and does not suffer from this issue either.

If you have a moment, please consider voting on my Connect item for this bug.

Resolution

The fix for this issue is now available and documented in a Knowledge Base article. Please note the fix requires a code update and trace flag 4199, which enables a range of other query processor changes. It is unusual for an incorrect-results bug to be fixed under 4199. I asked for clarification on that and the response was:

Even though this problem involves incorrect results like other hotfixes involving the Query Processor we have only enabled this fix under trace flag 4199 for SQL Server 2008, 2008 R2, and 2012. However, this fix is “on” by default without the trace flag in SQL Server 2014 RTM.