提高多个日期范围谓词的性能

Eri*_*ing 10 sql-server sql-server-2019 query-performance

比方说

\n

您有一个接受日期时间数组的存储过程,这些数组被加载到临时表中,并用于过滤表中的日期时间列。

\n
    \n
  • 可以插入任意数量的值作为开始日期和结束日期。
  • \n
  • 日期范围有时可能会重叠,但这不是我经常依赖的情况。
  • \n
  • 也可以提供带有时间的日期。
  • \n
\n

编写查询来执行过滤的最有效方法是什么?

\n

设置

\n
USE StackOverflow2013;\n\nCREATE TABLE\n    #d\n(\n    dfrom datetime,\n    dto datetime,\n    PRIMARY KEY (dfrom, dto)\n)\nINSERT\n    #d\n(\n    dfrom,\n    dto\n)\nSELECT\n    dfrom = \'2013-11-20\',\n    dto =   \'2013-12-05\'\nUNION ALL\nSELECT\n    dfrom = \'2013-11-27\',\n    dto =   \'2013-12-12\'; \n\nCREATE INDEX\n    p\nON dbo.Posts\n    (CreationDate)\nWITH\n    (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);\n
Run Code Online (Sandbox Code Playgroud)\n

询问

\n

我能得到的最好的就是EXISTS像这样使用:

\n
SELECT\n    c = COUNT_BIG(*)\nFROM dbo.Posts AS p\nWHERE EXISTS\n(\n    SELECT\n        1/0\n    FROM #d AS d\n    WHERE p.CreationDate BETWEEN d.dfrom\n                             AND d.dto\n);\n
Run Code Online (Sandbox Code Playgroud)\n

这导致了一个看起来相当悲伤的执行计划:

\n

坚果

\n

嵌套循环是唯一可用的连接运算符,因为我们没有相等谓词。

\n

我\xe2\x80\x99m 寻找的是产生不同类型连接的替代语法。

\n

谢谢!

\n

Pau*_*ite 9

加入

您可能会发现,尽管仍然使用嵌套循环,但联接仍能提供足够的性能。这是大卫·布朗更新答案的变体:

SELECT 
    NumRows = COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
JOIN dbo.Posts AS P
    ON P.CreationDate BETWEEN D.dfrom AND D.dto;
Run Code Online (Sandbox Code Playgroud)

加盟计划

对我来说,运行时间约为 150 毫秒。

删除重叠并连接

如果您有大量重叠的行,则可能值得将它们减少到不同的、不重叠的范围,如 Martin Smith 的回答中所述。我的类似想法的实现是:

WITH 
    Intervals AS
    (
    SELECT 
        IntervalStart =
            ISNULL
            (
                LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
                Q1.FirstFrom
            ),
        IntervalEnd = 
            IIF 
            (
                Q1.NextStart IS NOT NULL, 
                Q1.ThisEnd,
                Q1.LastEnd
            )
    FROM 
    (
        SELECT
            ThisFrom = D.dfrom, 
            ThisEnd = D.dto,
            -- Remember the start of the next row because that row
            -- may get filtered out by the outer WHERE clase if it
            -- is not also the end of an interval.
            NextStart = LEAD(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto),
            -- Start point of the first interval
            FirstFrom = MIN(D.dfrom) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
            -- End point of the last interval
            LastEnd = MAX(D.dto) OVER (
                ORDER BY D.dfrom, D.dto 
                ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
        FROM #d AS D 
        WHERE 
            -- Valid intervals only
            D.dto >= D.dfrom
    ) AS Q1
    WHERE
        -- Interval ends only
        Q1.NextStart > Q1.ThisEnd
        OR Q1.NextStart IS NULL
)
SELECT COUNT_BIG(*) 
FROM dbo.Posts AS P
JOIN Intervals AS N
    ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;
Run Code Online (Sandbox Code Playgroud)

合并间隔

对我来说大约需要 40 毫秒。

无连接的动态查找

另一种方法是动态生成文字范围:

DECLARE @SQL nvarchar(max) =
N'
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
';

SELECT @SQL +=
    STRING_AGG
    (
        CONCAT
        (
            CONVERT(nvarchar(max), SPACE(0)),
            N'OR P.CreationDate BETWEEN ',
            N'CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dfrom, 121),
            NCHAR(39),
            N', 121)',
            N' AND CONVERT(datetime, ',
            NCHAR(39),
            CONVERT(nchar(23), D.dto, 121),
            NCHAR(39),
            N', 121)',
            NCHAR(13), NCHAR(10)
        ),
        N''
    )
FROM #d AS D;

SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely

EXECUTE (@SQL);
Run Code Online (Sandbox Code Playgroud)

使用样本数据,会产生:

SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121) 
    AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
OPTION (RECOMPILE);
Run Code Online (Sandbox Code Playgroud)

哪个 SQL Server 简化为单个范围查找:

寻找

对我来说大约需要 35 毫秒。

一般来说,优化器将尽可能地简化范围(就像合并间隔所做的那样)。单个索引查找运算符中单独范围查找的数量似乎没有限制。1440次搜索后我感到无聊。

索引视图

在索引视图中存储和维护分桶计数可能会有所帮助,而不是一遍又一遍地计算行数。以下实现使用小时粒度:

CREATE OR ALTER VIEW dbo.PostsTimeBucket
WITH SCHEMABINDING
AS
SELECT
    HourBucket =
        DATEADD(HOUR, 
            DATEDIFF(HOUR, 
                CONVERT(datetime, '20000101', 112), 
                P.CreationDate),
            CONVERT(datetime, '20000101', 112)),
    NumRows = COUNT_BIG(*)
FROM dbo.Posts AS P
GROUP BY
    DATEADD(HOUR, 
        DATEDIFF(HOUR, 
            CONVERT(datetime, '20000101', 112), 
            P.CreationDate),
        CONVERT(datetime, '20000101', 112));
GO
CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket] 
ON dbo.PostsTimeBucket (HourBucket);
Run Code Online (Sandbox Code Playgroud)

在我的笔记本电脑上创建该视图大约需要 2 秒。在小时粒度上,索引视图为我的 StackOverflow2013 数据库Posts表(包含 17,142,169 行)副本保存了 47,469 行。

大部分工作可以从视图中完成。仅需要单独处理未包含在任何其他范围内的任何范围的开始和结束处的部分小时时段。例如:

-- Helper inline function
CREATE OR ALTER FUNCTION dbo.RoundToHour (@d datetime)
RETURNS table
AS
RETURN
    SELECT
        HourBucket =
            DATEADD(HOUR, 
                DATEDIFF(HOUR, 
                    CONVERT(datetime, '20000101', 112), @d),
                CONVERT(datetime, '20000101', 112));
Run Code Online (Sandbox Code Playgroud)
-- Whole hours covered by any range in the table
SELECT 
    SUM(PTB.NumRows) 
FROM dbo.PostsTimeBucket AS PTB 
    WITH (NOEXPAND)
WHERE 
    PTB.HourBucket >= (SELECT MIN(D.dfrom) FROM #d AS D)
    AND PTB.HourBucket <= (SELECT MAX(D.dto) FROM #d AS D)
    AND EXISTS
    (
        SELECT * 
        FROM #d AS D
        WHERE
            D.dfrom <= PTB.HourBucket
            AND D.dto >= DATEADD(HOUR, 1, PTB.HourBucket)
    )
Run Code Online (Sandbox Code Playgroud)

索引视图

-- Extra rows at start before hour boundary
SELECT 
    COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dfrom) AS HF
JOIN dbo.Posts AS P
    ON P.CreationDate >= D.dfrom
    AND P.CreationDate < DATEADD(HOUR, 1, HF.HourBucket)
WHERE
    D.dfrom <> HF.HourBucket
    AND NOT EXISTS
    (
        -- Not covered by any hour-long period
        SELECT * 
        FROM #d AS D2
        WHERE
            D2.dfrom <= HF.HourBucket
            AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
    );
Run Code Online (Sandbox Code Playgroud)

额外开始时间

-- Extra rows at end
SELECT 
    COUNT_BIG(DISTINCT P.Id) 
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dto) AS HF
JOIN dbo.Posts AS P
    ON P.CreationDate >= HF.HourBucket
    AND P.CreationDate < D.dto
WHERE
    D.dto <> HF.HourBucket
    AND NOT EXISTS
    (
        -- Not covered by any hour-long period
        SELECT * 
        FROM #d AS D2
        WHERE
            D2.dfrom <= HF.HourBucket
            AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
    );
Run Code Online (Sandbox Code Playgroud)

末尾有额外的行

上述三个查询总共在不到 1 毫秒的时间内完成。样本数据中没有非小时时段。随着部分周期越来越多,或者使用不太精确的索引视图粒度,性能会有所下降。