Eri*_*ing 10 sql-server sql-server-2019 query-performance
您有一个接受日期时间数组的存储过程,这些数组被加载到临时表中,并用于过滤表中的日期时间列。
\n编写查询来执行过滤的最有效方法是什么?
\nUSE StackOverflow2013;\n\nCREATE TABLE\n #d\n(\n dfrom datetime,\n dto datetime,\n PRIMARY KEY (dfrom, dto)\n)\nINSERT\n #d\n(\n dfrom,\n dto\n)\nSELECT\n dfrom = \'2013-11-20\',\n dto = \'2013-12-05\'\nUNION ALL\nSELECT\n dfrom = \'2013-11-27\',\n dto = \'2013-12-12\'; \n\nCREATE INDEX\n p\nON dbo.Posts\n (CreationDate)\nWITH\n (SORT_IN_TEMPDB = ON, DATA_COMPRESSION = PAGE);\nRun Code Online (Sandbox Code Playgroud)\n我能得到的最好的就是EXISTS像这样使用:
SELECT\n c = COUNT_BIG(*)\nFROM dbo.Posts AS p\nWHERE EXISTS\n(\n SELECT\n 1/0\n FROM #d AS d\n WHERE p.CreationDate BETWEEN d.dfrom\n AND d.dto\n);\nRun Code Online (Sandbox Code Playgroud)\n这导致了一个看起来相当悲伤的执行计划:
\n\n嵌套循环是唯一可用的连接运算符,因为我们没有相等谓词。
\n我\xe2\x80\x99m 寻找的是产生不同类型连接的替代语法。
\n谢谢!
\n您可能会发现,尽管仍然使用嵌套循环,但联接仍能提供足够的性能。这是大卫·布朗更新答案的变体:
SELECT
NumRows = COUNT_BIG(DISTINCT P.Id)
FROM #d AS D
JOIN dbo.Posts AS P
ON P.CreationDate BETWEEN D.dfrom AND D.dto;
Run Code Online (Sandbox Code Playgroud)
对我来说,运行时间约为 150 毫秒。
如果您有大量重叠的行,则可能值得将它们减少到不同的、不重叠的范围,如 Martin Smith 的回答中所述。我的类似想法的实现是:
WITH
Intervals AS
(
SELECT
IntervalStart =
ISNULL
(
LAG(Q1.NextStart) OVER (ORDER BY Q1.ThisFrom),
Q1.FirstFrom
),
IntervalEnd =
IIF
(
Q1.NextStart IS NOT NULL,
Q1.ThisEnd,
Q1.LastEnd
)
FROM
(
SELECT
ThisFrom = D.dfrom,
ThisEnd = D.dto,
-- Remember the start of the next row because that row
-- may get filtered out by the outer WHERE clase if it
-- is not also the end of an interval.
NextStart = LEAD(D.dfrom) OVER (
ORDER BY D.dfrom, D.dto),
-- Start point of the first interval
FirstFrom = MIN(D.dfrom) OVER (
ORDER BY D.dfrom, D.dto
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
-- End point of the last interval
LastEnd = MAX(D.dto) OVER (
ORDER BY D.dfrom, D.dto
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM #d AS D
WHERE
-- Valid intervals only
D.dto >= D.dfrom
) AS Q1
WHERE
-- Interval ends only
Q1.NextStart > Q1.ThisEnd
OR Q1.NextStart IS NULL
)
SELECT COUNT_BIG(*)
FROM dbo.Posts AS P
JOIN Intervals AS N
ON P.CreationDate BETWEEN N.IntervalStart AND N.IntervalEnd;
Run Code Online (Sandbox Code Playgroud)
对我来说大约需要 40 毫秒。
另一种方法是动态生成文字范围:
DECLARE @SQL nvarchar(max) =
N'
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
';
SELECT @SQL +=
STRING_AGG
(
CONCAT
(
CONVERT(nvarchar(max), SPACE(0)),
N'OR P.CreationDate BETWEEN ',
N'CONVERT(datetime, ',
NCHAR(39),
CONVERT(nchar(23), D.dfrom, 121),
NCHAR(39),
N', 121)',
N' AND CONVERT(datetime, ',
NCHAR(39),
CONVERT(nchar(23), D.dto, 121),
NCHAR(39),
N', 121)',
NCHAR(13), NCHAR(10)
),
N''
)
FROM #d AS D;
SET @SQL += N'OPTION (RECOMPILE);' -- Plan reuse is unlikely
EXECUTE (@SQL);
Run Code Online (Sandbox Code Playgroud)
使用样本数据,会产生:
SELECT Rows = COUNT_BIG(*)
FROM dbo.Posts AS P
WHERE 0 = 1
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-20 00:00:00.000', 121)
AND CONVERT(datetime, '2013-12-05 00:00:00.000', 121)
OR P.CreationDate BETWEEN CONVERT(datetime, '2013-11-27 00:00:00.000', 121)
AND CONVERT(datetime, '2013-12-12 00:00:00.000', 121)
OPTION (RECOMPILE);
Run Code Online (Sandbox Code Playgroud)
哪个 SQL Server 简化为单个范围查找:
对我来说大约需要 35 毫秒。
一般来说,优化器将尽可能地简化范围(就像合并间隔所做的那样)。单个索引查找运算符中单独范围查找的数量似乎没有限制。1440次搜索后我感到无聊。
在索引视图中存储和维护分桶计数可能会有所帮助,而不是一遍又一遍地计算行数。以下实现使用小时粒度:
CREATE OR ALTER VIEW dbo.PostsTimeBucket
WITH SCHEMABINDING
AS
SELECT
HourBucket =
DATEADD(HOUR,
DATEDIFF(HOUR,
CONVERT(datetime, '20000101', 112),
P.CreationDate),
CONVERT(datetime, '20000101', 112)),
NumRows = COUNT_BIG(*)
FROM dbo.Posts AS P
GROUP BY
DATEADD(HOUR,
DATEDIFF(HOUR,
CONVERT(datetime, '20000101', 112),
P.CreationDate),
CONVERT(datetime, '20000101', 112));
GO
CREATE UNIQUE CLUSTERED INDEX [CUQ dbo.PostsTimeBucket HourBucket]
ON dbo.PostsTimeBucket (HourBucket);
Run Code Online (Sandbox Code Playgroud)
在我的笔记本电脑上创建该视图大约需要 2 秒。在小时粒度上,索引视图为我的 StackOverflow2013 数据库Posts表(包含 17,142,169 行)副本保存了 47,469 行。
大部分工作可以从视图中完成。仅需要单独处理未包含在任何其他范围内的任何范围的开始和结束处的部分小时时段。例如:
-- Helper inline function
CREATE OR ALTER FUNCTION dbo.RoundToHour (@d datetime)
RETURNS table
AS
RETURN
SELECT
HourBucket =
DATEADD(HOUR,
DATEDIFF(HOUR,
CONVERT(datetime, '20000101', 112), @d),
CONVERT(datetime, '20000101', 112));
Run Code Online (Sandbox Code Playgroud)
-- Whole hours covered by any range in the table
SELECT
SUM(PTB.NumRows)
FROM dbo.PostsTimeBucket AS PTB
WITH (NOEXPAND)
WHERE
PTB.HourBucket >= (SELECT MIN(D.dfrom) FROM #d AS D)
AND PTB.HourBucket <= (SELECT MAX(D.dto) FROM #d AS D)
AND EXISTS
(
SELECT *
FROM #d AS D
WHERE
D.dfrom <= PTB.HourBucket
AND D.dto >= DATEADD(HOUR, 1, PTB.HourBucket)
)
Run Code Online (Sandbox Code Playgroud)
-- Extra rows at start before hour boundary
SELECT
COUNT_BIG(DISTINCT P.Id)
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dfrom) AS HF
JOIN dbo.Posts AS P
ON P.CreationDate >= D.dfrom
AND P.CreationDate < DATEADD(HOUR, 1, HF.HourBucket)
WHERE
D.dfrom <> HF.HourBucket
AND NOT EXISTS
(
-- Not covered by any hour-long period
SELECT *
FROM #d AS D2
WHERE
D2.dfrom <= HF.HourBucket
AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
);
Run Code Online (Sandbox Code Playgroud)
-- Extra rows at end
SELECT
COUNT_BIG(DISTINCT P.Id)
FROM #d AS D
CROSS APPLY dbo.RoundToHour(D.dto) AS HF
JOIN dbo.Posts AS P
ON P.CreationDate >= HF.HourBucket
AND P.CreationDate < D.dto
WHERE
D.dto <> HF.HourBucket
AND NOT EXISTS
(
-- Not covered by any hour-long period
SELECT *
FROM #d AS D2
WHERE
D2.dfrom <= HF.HourBucket
AND D2.dto > DATEADD(HOUR, 1, HF.HourBucket)
);
Run Code Online (Sandbox Code Playgroud)
上述三个查询总共在不到 1 毫秒的时间内完成。样本数据中没有非小时时段。随着部分周期越来越多,或者使用不太精确的索引视图粒度,性能会有所下降。