Mar 202013
 

Introduction

Since their introduction in SQL Server 2005, window functions like ROW_NUMBER and RANK have proven to be extremely useful in solving a wide variety of common T-SQL problems. In an attempt to generalize such solutions, database designers often look to incorporate them into views to promote code encapsulation and reuse. Unfortunately, a limitation in the SQL Server query optimizer often means that views* containing window functions do not perform as well as expected. This post works through an illustrative example of the problem, details the reasons, and provides a number of workarounds.

*This problem can also occur in derived tables, common table expressions and in-line functions, but I see it most often with views because they are intentionally written to be more generic.

Window functions

Window functions are distinguished by the presence of an OVER() clause and come in three varieties:

  • Ranking window functions
    • ROW_NUMBER
    • RANK
    • DENSE_RANK
    • NTILE
  • Aggregate window functions
    • MIN, MAX, AVG, SUM
    • COUNT, COUNT_BIG
    • CHECKSUM_AGG
    • STDEV, STDEVP, VAR, VARP
  • Analytic window functions
    • LAG, LEAD
    • FIRST_VALUE, LAST_VALUE
    • PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, CUME_DIST

The ranking and aggregate window functions were introduced in SQL Server 2005, and considerably extended in SQL Server 2012. The analytic window functions are new for SQL Server 2012.

All of the window functions listed above are susceptible to the optimizer limitation detailed in this article.

Example

Using the AdventureWorks sample database, the task at hand is to write a query that returns all product #878 transactions that occurred on the most recent date available. There are all sorts of ways to express this requirement in T-SQL, but we will choose to write a query that uses a windowing function. The first step is to find transaction records for product #878 and rank them in date order descending:

SELECT
    th.TransactionID,
    th.ReferenceOrderID,
    th.TransactionDate,
    th.Quantity,
    rnk = RANK() OVER (
        ORDER BY th.TransactionDate DESC)
FROM Production.TransactionHistory AS th
WHERE
    th.ProductID = 878
ORDER BY
    rnk;

Sample output for product 878

Original execution plan

The results of the query are as expected, with six transactions occurring on the most recent date available. The execution plan contains a warning triangle, alerting us to a missing index:

Missing index

As usual for missing index suggestions, we need to remember that the recommendation is not the result of a through analysis of the query – it is more of an indication that we need to think a bit about how this query accesses the data it needs.

The suggested index would certainly be more efficient than scanning the table completely, since it would allow an index seek to the particular product we are interested in. The index would also cover all the columns needed, but it would not avoid the sort (by TransactionDate descending). The ideal index for this query would allow a seek on ProductID, return the selected records in reverse TransactionDate order, and cover the other returned columns:

CREATE NONCLUSTERED INDEX ix
ON Production.TransactionHistory
    (ProductID, TransactionDate DESC)
INCLUDE 
    (ReferenceOrderID, Quantity);

With that index in place, the execution plan is much more efficient. The clustered index scan has been replaced by a range seek, and an explicit sort is no longer necessary:

Optimal execution plan

The final step for this query is to limit the results to just those rows that rank #1. We cannot filter directly in the WHERE clause of our query because window functions may only appear in the SELECT and ORDER BY clauses.

We can workaround this restriction using a derived table, common table expression, function, or view. On this occasion, we will use a common table expression (aka an in-line view):

WITH RankedTransactions AS
(
    SELECT
        th.TransactionID,
        th.ReferenceOrderID,
        th.TransactionDate,
        th.Quantity,
        rnk = RANK() OVER (
            ORDER BY th.TransactionDate DESC)
    FROM Production.TransactionHistory AS th
    WHERE
        th.ProductID = 878
)
SELECT
    TransactionID,
    ReferenceOrderID,
    TransactionDate,
    Quantity
FROM RankedTransactions
WHERE
    rnk = 1;

The execution plan is the same as before, with an extra Filter to return only rows ranked #1:

Rank 1 query plan

Filter properties

The query returns the six equally ranked rows we expect:

Six row query result

Generalizing the query

It turns out that our query is very useful, so the decision is taken to generalize it and store the definition in a view. For this to work for any product, we need to do two things: return the ProductID from the view, and partition the ranking function by product:

CREATE VIEW dbo.MostRecentTransactionsPerProduct
WITH SCHEMABINDING
AS
SELECT
    sq1.ProductID,
    sq1.TransactionID,
    sq1.ReferenceOrderID,
    sq1.TransactionDate,
    sq1.Quantity
FROM 
(
    SELECT
        th.ProductID,
        th.TransactionID,
        th.ReferenceOrderID,
        th.TransactionDate,
        th.Quantity,
        rnk = RANK() OVER (
            PARTITION BY th.ProductID
            ORDER BY th.TransactionDate DESC)
    FROM Production.TransactionHistory AS th
) AS sq1
WHERE
    sq1.rnk = 1;

Selecting all the rows from the view results in the following execution plan and correct results:

Full view query plan

View sample results

We can now find the most recent transactions for product 878 with a much simpler query on the view:

SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = 878;

Our expectation is that the execution plan for this new query will be exactly the same as before we created the view. The query optimizer should be able to push the filter specified in the WHERE clause down into the view, resulting in an index seek.

We need to stop and think a bit at this point, however. The query optimizer can only produce execution plans that are guaranteed to produce the same results as the logical query specification – is it safe to push our WHERE clause into the view?<

The answer is yes, so long as the column we are filtering on appears in the PARTITION BY clause of the window function in the view. The reasoning is that eliminating complete groups (partitions) from the window function will not affect the ranking of rows returned by the query. The question is, does the SQL Server query optimizer know this? The answer depends on which version of SQL Server we are running.

SQL Server 2005 execution plan

2005 execution plan

A look at the Filter properties in this plan shows it applying two predicates:

Filter properties

The ProductID = 878 predicate has not been pushed down into the view, resulting in a plan that scans our index, ranking every row in the table before filtering for product #878 and rows ranked #1.

The SQL Server 2005 query optimizer cannot push suitable predicates past a window function in a lower query scope (view, common table expression, in-line function or derived table). This limitation applies to all SQL Server 2005 builds.

SQL Server 2008+ execution plan

This is the execution plan for the same query on SQL Server 2008 or later:

2008 execution plan

The ProductID predicate has been successfully pushed past the ranking operators, replacing the index scan with the efficient index seek.

The 2008 query optimizer includes a new simplification rule SelOnSeqPrj (select on sequence project) that is able to push safe outer-scope predicates past window functions. To produce the less efficient plan for this query in SQL Server 2008 or later, we have to temporarily disable this query optimizer feature:

SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = 878
OPTION (QUERYRULEOFF SelOnSeqPrj);

Plan with rule disabled

Unfortunately, the SelOnSeqPrj simplification rule only works when the predicate performs a comparison with a constant. For that reason, the following query produces the sub-optimal plan on SQL Server 2008 and later:

DECLARE @ProductID INT = 878;
 
SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = @ProductID;

Execution plan with variable reference

The problem can still occur even where the predicate uses a constant value. SQL Server may decide to auto-parameterize trivial queries (one for which an obvious best plan exists). If auto-parameterization is successful, the optimizer sees a parameter instead of a constant, and the SelOnSeqPrj rule is not applied.

For queries where auto-parameterization is not attempted (or where it is determined to be unsafe), the optimization may still fail, if the database option for FORCED PARAMETERIZATION is on. Our test query (with the constant value 878) is not safe for auto-parameterization, but the forced parameterization setting overrides this, resulting in the inefficient plan:

ALTER DATABASE AdventureWorks
SET PARAMETERIZATION FORCED;
GO
SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = 878;
GO
ALTER DATABASE AdventureWorks
SET PARAMETERIZATION SIMPLE;

Forced parameterization plan

SQL Server 2008+ Workaround

To allow the optimizer to ‘see’ a constant value for query that references a local variable or parameter we can add an OPTION (RECOMPILE) query hint:

DECLARE @ProductID INT = 878;
 
SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = @ProductID
OPTION (RECOMPILE);

Note: The pre-execution (‘estimated’) execution plan still shows an index scan because the value of the variable is not actually set yet. When the query is executed, however, the execution plan shows the desired index seek plan:

Post-execution plan

The SelOnSeqPrj rule does not exist in SQL Server 2005, so OPTION (RECOMPILE) cannot help there. In case you are wondering, the OPTION (RECOMPILE) workaround results in a seek even if the database option for forced parameterization is on.

All versions workaround #1

In some cases, it is possible to replace the problematic view, common table expression, or derived table with a parameterized in-line table-valued function:

CREATE FUNCTION dbo.MostRecentTransactionsForProduct
(
    @ProductID integer
)  
RETURNS TABLE
WITH SCHEMABINDING AS
RETURN
    SELECT
        sq1.ProductID,
        sq1.TransactionID,
        sq1.ReferenceOrderID,
        sq1.TransactionDate,
        sq1.Quantity
    FROM 
    (
        SELECT
            th.ProductID,
            th.TransactionID,
            th.ReferenceOrderID,
            th.TransactionDate,
            th.Quantity,
            rnk = RANK() OVER (
                PARTITION BY th.ProductID
                ORDER BY th.TransactionDate DESC)
        FROM Production.TransactionHistory AS th
        WHERE
            th.ProductID = @ProductID
    ) AS sq1
    WHERE
        sq1.rnk = 1;

This function explicitly places the ProductID predicate in the same scope as the window function, avoiding the optimizer limitation. Written to use the in-line function, our example query becomes:

SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsForProduct(878) AS mrt;

This produces the desired index seek plan on all versions of SQL Server that support window functions. This workaround produces a seek even where the predicate references a parameter or local variable – OPTION (RECOMPILE) is not required.<

The function body could of course be simplified to remove the now-redundant PARTITION BY clause, and to no longer return the ProductID column. I left the definition the same as the view it replaced to more clearly illustrate the cause of the execution plan differences.

All versions workaround #2

The second workaround only applies to ranking window functions that are filtered to return rows numbered or ranked #1 (using ROW_NUMBER, RANK, or DENSE_RANK). This is a very common usage however, so it is worth mentioning.

An additional benefit is that this workaround can produce plans that are even more efficient than the index seek plans seen previously. As a reminder, the previous best plan looked like this:

Previous best execution plan

That execution plan ranks 1,918 rows even though it ultimately returns only 6. We can improve this execution plan by using the window function in an ORDER BY clause instead of ranking rows and then filtering for rank #1:

SELECT TOP (1) WITH TIES
    th.TransactionID,
    th.ReferenceOrderID,
    th.TransactionDate,
    th.Quantity
FROM Production.TransactionHistory AS th
WHERE
    th.ProductID = 878
ORDER BY
    RANK() OVER (
        ORDER BY th.TransactionDate DESC);

Improved execution plan

That query nicely illustrates the use of a window function in the ORDER BY clause, but we can do even better, eliminating the window function completely:

SELECT TOP (1) WITH TIES
    th.TransactionID,
    th.ReferenceOrderID,
    th.TransactionDate,
    th.Quantity
FROM Production.TransactionHistory AS th
WHERE
    th.ProductID = 878
ORDER BY
    th.TransactionDate DESC;

Top Seek Plan

This plan reads only 7 rows from the table to return the same 6-row result set. Why 7 rows? The Top operator is running in WITH TIES mode:

Top properties

It continues to request one row at a time from its subtree until the TransactionDate changes. The seventh row is required for the Top to be sure that no more tied-value rows will qualify.

We can extend the logic of the query above to replace the problematic view definition:

ALTER VIEW dbo.MostRecentTransactionsPerProduct
WITH SCHEMABINDING
AS
SELECT
    p.ProductID,
    Ranked1.TransactionID,
    Ranked1.ReferenceOrderID,
    Ranked1.TransactionDate,
    Ranked1.Quantity
FROM
    -- List of product IDs
    (SELECT ProductID FROM Production.Product) AS p
CROSS APPLY
(
    -- Returns rank #1 results for each product ID
    SELECT TOP (1) WITH TIES
        th.TransactionID,
        th.ReferenceOrderID,
        th.TransactionDate,
        th.Quantity
    FROM Production.TransactionHistory AS th
    WHERE
        th.ProductID = p.ProductID
    ORDER BY
        th.TransactionDate DESC
) AS Ranked1;

The view now uses a CROSS APPLY to combine the results of our optimized ORDER BY query for each product. Our test query is unchanged:

DECLARE @ProductID integer;
SET @ProductID = 878;
 
SELECT
    mrt.ProductID,
    mrt.TransactionID,
    mrt.ReferenceOrderID,
    mrt.TransactionDate,
    mrt.Quantity
FROM dbo.MostRecentTransactionsPerProduct AS mrt 
WHERE
    mrt.ProductID = @ProductID;

Both pre- and post-execution plans show an index seek without needing an OPTION (RECOMPILE) query hint. The following is a post-execution (‘actual’) plan:

APPLY execution plan

If the view had used ROW_NUMBER instead of RANK, the replacement view would simply have omitted the WITH TIES clause on the TOP (1). The new view could also be written as a parameterized in-line table-valued function of course.

One could argue that the original index seek plan with the rnk = 1 predicate could also be optimized to only test 7 rows. After all, the optimizer should know that rankings are produced by the Sequence Project operator in strict ascending order, so execution could end as soon as a row with a rank greater than one is seen. The optimizer does not contain this logic today, however.

Final Thoughts

People are often disappointed by the performance of views that incorporate window functions. The reason can often be traced back to the optimizer limitation described in this post (or perhaps because the view designer did not appreciate that predicates applied to the view must appear in the PARTITION BY clause to be safely pushed down).

I do want to emphasise that this limitation does not just apply to views, and neither is it limited to ROW_NUMBER, RANK, and DENSE_RANK. You should be aware of this limitation when using any function with an OVER clause in a view, common table expression, derived table, or in-line table-valued function.

SQL Server 2005 users that encounter this issue are faced with the choice of rewriting the view as a parameterized in-line table-valued function, or using the APPLY technique (where applicable).

SQL Server 2008 users have the extra option of using an OPTION (RECOMPILE) query hint if the issue can be solved by allowing the optimizer to see a constant instead of a variable or parameter reference. Remember to check post-execution plans when using this hint though: the pre-execution plan cannot generally show the optimal plan.

  7 Responses to “The Problem with Window Functions and Views”

  1. Hi Paul,

    Great post, heres another option for the rank=1 check, you need to flip your index round for this to work.

    You would guess that the query optimiser could figure this out from what you are asking but with a bit of prompting, it finally gets what you want it to do.

    CREATE NONCLUSTERED INDEX ix
    ON Production.TransactionHistory
        (ProductID, TransactionDate)
    INCLUDE 
        (ReferenceOrderID, Quantity);
     
    SELECT th.ProductID,
    	th.TransactionID,
    	th.ReferenceOrderID,
    	th.TransactionDate,
    	th.Quantity
    FROM Production.TransactionHistory AS th
    JOIN (
    	SELECT	ProductID,
    		max(TransactionDate) as mn
    	FROM	Production.TransactionHistory
    	GROUP BY	ProductID
    ) as mins
    ON	mins.ProductID = th.ProductID
    AND	mins.mn = th.TransactionDate
  2. Hi Lee,

    Yes, that was one of the reasons I wrote, "There are all sorts of ways to express this requirement in T-SQL, but we will choose to write a query that uses a windowing function."

    The particular index and rewrite you use there was the subject of my first ever blog post for SQLblog.com:

    The Segment Top Query Optimization

    The particular query I used was just an example to allow me to talk about window functions, but thanks for the feedback anyway.

    Paul

  3. "CREATE VIEW dbo.MostRecentTransactionsPerProduct
    WITH SCHEMABINDING"

    I have noticed you regularly use the 'WITH SCHEMABINDING' option on your VIEWs. Do you have post or something already written for the pros and cons of this? None of my views/UDFs are schemabound now and i am comtemplating the value of it because it is the requirement for putting an index on a view. However, introducing something like this into a well established process and sizable schema seems scary. Thoughts?

  4. Hi Dan,

    There are important reasons to use schema binding with functions (see my recent posts on the Halloween Problem on this site for details). As far as views are concerned, this is partly habit, but I do find it useful when cleaning up test scripts – I can't drop the tables I create and leave the views hanging. More practically, I would often forget to run sp_refreshview or sp_refreshsqlmodule when changing an object that some view depends on, without which unexpected results could occur.

    Paul

  5. Paul, I learned a LOT from this article.. thanks so much for writing it and documenting everything so well!

  6. Great article Paul. Also it would be great if you also update it with Sql server 2012 enhancement (if any you think as mentioned in http://technet.microsoft.com/en-us/library/ms189461)
    Thanks

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">