Pau*_*ite 59 performance sql-server t-sql sql-server-2012 window-functions query-performance
我需要计算一个日期范围内的滚动总和。为了说明,使用AdventureWorks 示例数据库,以下假设语法将完全满足我的需要:
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 = SUM(TH.ActualCost) OVER (
PARTITION BY TH.ProductID
ORDER BY TH.TransactionDate
RANGE BETWEEN
INTERVAL 45 DAY PRECEDING
AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Run Code Online (Sandbox Code Playgroud)
遗憾的是,RANGE窗口框架范围目前在 SQL Server 中不允许间隔。
我知道我可以使用子查询和常规(非窗口)聚合编写解决方案:
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 =
(
SELECT SUM(TH2.ActualCost)
FROM Production.TransactionHistory AS TH2
WHERE
TH2.ProductID = TH.ProductID
AND TH2.TransactionDate <= TH.TransactionDate
AND TH2.TransactionDate >= DATEADD(DAY, -45, TH.TransactionDate)
)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Run Code Online (Sandbox Code Playgroud)
鉴于以下索引:
CREATE UNIQUE INDEX i
ON Production.TransactionHistory
(ProductID, TransactionDate, ReferenceOrderID)
INCLUDE
(ActualCost);
Run Code Online (Sandbox Code Playgroud)
执行计划是:

虽然效率不是非常低,但似乎应该可以仅使用 SQL Server 2012、2014 或 2016(到目前为止)支持的窗口聚合和分析函数来表达此查询。
为清楚起见,我正在寻找一种对数据执行单次传递的解决方案。
在T-SQL这很可能意味着,该OVER条款将做的工作,并执行计划将采用窗口线轴和窗口聚集。使用该OVER子句的所有语言元素都是公平游戏。SQLCLR 解决方案是可以接受的,前提是它可以保证产生正确的结果。
对于 T-SQL 解决方案,执行计划中的哈希、排序和窗口假脱机/聚合越少越好。随意添加索引,但不允许使用单独的结构(例如,没有预先计算的表与触发器保持同步)。允许参考表(数字表、日期表等)
理想情况下,解决方案将以与上述子查询版本相同的顺序产生完全相同的结果,但任何可以说是正确的也是可以接受的。性能始终是一个考虑因素,因此解决方案至少应该是合理有效的。
专用聊天室:我创建了一个公共聊天室,用于与此问题及其答案相关的讨论。任何拥有至少 20 个声望点的用户都可以直接参与。如果您的代表少于 20 人并想参加,请在下面的评论中告诉我。
Geo*_*son 44
好问题,保罗!我使用了几种不同的方法,一种在 T-SQL 中,一种在 CLR 中。
T-SQL 快速总结
T-SQL 方法可以概括为以下步骤:
使用SET STATISTICS IO ON,此方法报告Table 'TransactionHistory'. Scan count 1, logical reads 484,这确认了表格上的“单程”。供参考,原始循环搜索查询报告Table 'TransactionHistory'. Scan count 113444, logical reads 438366。
据报道SET STATISTICS TIME ON,CPU 时间为514ms. 这与2231ms原始查询相比更有利。
CLR 快速总结
CLR总结可以概括为以下步骤:
使用SET STATISTICS IO ON,这种方法报告没有发生逻辑 I/O!哇,完美的解决方案!(其实好像SET STATISTICS IO并没有报告CLR内部发生的I/O。但是从代码中不难看出,正好对表进行了一次扫描,并按照Paul建议的索引顺序检索数据。
据报道SET STATISTICS TIME ON,CPU时间是现在187ms。所以这是对 T-SQL 方法的相当大的改进。不幸的是,这两种方法的总体运行时间非常相似,每个都大约为半秒。但是,基于 CLR 的方法确实必须向控制台输出 113K 行(而按产品/日期分组的 T-SQL 方法仅为 52K),所以这就是我关注 CPU 时间的原因。
这种方法的另一个大优点是它产生与原始循环/搜索方法完全相同的结果,即使在同一天多次销售产品的情况下,每笔交易也包括一行。(在 AdventureWorks 上,我专门比较了逐行结果,并确认它们与 Paul 的原始查询相符。)
这种方法的一个缺点,至少在目前的形式中,是它读取内存中的所有数据。然而,设计的算法在任何给定时间只严格需要内存中的当前窗口框架,并且可以更新以适用于超出内存的数据集。保罗在他的回答中通过生成该算法的实现来说明这一点,该实现仅在内存中存储滑动窗口。这是以授予 CLR 程序集更高权限为代价的,但绝对值得将此解决方案扩展到任意大的数据集。
最初设定
USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
INSERT INTO dbo.calendar (d) VALUES (@d)
SELECT @d = DATEADD(DAY, 1, @d)
END
GO
Run Code Online (Sandbox Code Playgroud)
查询
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)
Run Code Online (Sandbox Code Playgroud)
执行计划
从执行计划中,我们看到 Paul 提出的原始索引足以让我们对 执行单次有序扫描Production.TransactionHistory,使用合并连接将交易历史与每个可能的产品/日期组合结合起来。
假设
这种方法有一些重要的假设。我想这将由保罗来决定是否可以接受:)
Production.Product桌子。该表可在 上免费使用,AdventureWorks2012并且该关系由来自 的外键强制执行Production.TransactionHistory,因此我将其解释为公平游戏。AdventureWorks2012;如果他们这样做了,则在不首先通过交易历史记录的情况下,将不再可能生成完整的产品/日期组合集。NumOrders列来指示发生了多少销售。如果产品在同一日期多次销售(例如319/ 2007-09-05 00:00:00.000),请参阅以下屏幕截图,以比较原始查询与建议查询的结果主要功能体
这里没什么可看的;函数主体声明输入(必须与相应的 SQL 函数匹配),建立 SQL 连接,并打开 SQLReader。
USE AdventureWorks2012
GO
-- Create Paul's index
CREATE UNIQUE INDEX i
ON Production.TransactionHistory (ProductID, TransactionDate, ReferenceOrderID)
INCLUDE (ActualCost);
GO
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
INSERT INTO dbo.calendar (d) VALUES (@d)
SELECT @d = DATEADD(DAY, 1, @d)
END
GO
Run Code Online (Sandbox Code Playgroud)
核心逻辑
我已经分离了主要逻辑,以便更容易关注:
DECLARE @minAnalysisDate DATE = '2007-09-01', -- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2008-09-03' -- Customizable end date depending on business needs
SELECT ProductID, TransactionDate, ActualCost, RollingSum45, NumOrders
FROM (
SELECT ProductID, TransactionDate, NumOrders, ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates, combined with actual cost information for that product/date
SELECT p.ProductID, c.d AS TransactionDate,
COUNT(TH.ProductId) AS NumOrders, SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.d BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.d
GROUP BY P.ProductID, c.d
) aggsByDay
) rollingSums
WHERE NumOrders > 0
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1)
Run Code Online (Sandbox Code Playgroud)
帮手
以下逻辑可以内联编写,但将它们拆分为自己的方法时会更容易阅读。
// SQL CLR function for rolling SUMs on AdventureWorks2012.Production.TransactionHistory
[SqlFunction(DataAccess = DataAccessKind.Read,
FillRowMethodName = "RollingSum_Fill",
TableDefinition = "ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT," +
"ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT")]
public static IEnumerable RollingSumTvf(SqlInt32 rollingPeriodDays) {
using (var connection = new SqlConnection("context connection=true;")) {
connection.Open();
List<TrxnRollingSum> trxns;
using (var cmd = connection.CreateCommand()) {
//Read the transaction history (note: the order is important!)
cmd.CommandText = @"SELECT ProductId, TransactionDate, ReferenceOrderID,
CAST(ActualCost AS FLOAT) AS ActualCost
FROM Production.TransactionHistory
ORDER BY ProductId, TransactionDate";
using (var reader = cmd.ExecuteReader()) {
trxns = ComputeRollingSums(reader, rollingPeriodDays.Value);
}
}
return trxns;
}
}
Run Code Online (Sandbox Code Playgroud)
在 SQL 中将它们组合在一起
到目前为止,一切都在 C# 中,所以让我们看看实际涉及的 SQL。(或者,您可以使用此部署脚本直接从我的程序集部分创建程序集,而不是自己编译。)
USE AdventureWorks2012; /* GPATTERSON2\SQL2014DEVELOPER */
GO
-- Enable CLR
EXEC sp_configure 'clr enabled', 1;
GO
RECONFIGURE;
GO
-- Create the assembly based on the dll generated by compiling the CLR project
-- I've also included the "assembly bits" version that can be run without compiling
CREATE ASSEMBLY ClrPlayground
-- See http://pastebin.com/dfbv1w3z for a "from assembly bits" version
FROM 'C:\FullPathGoesHere\ClrPlayground\bin\Debug\ClrPlayground.dll'
WITH PERMISSION_SET = safe;
GO
--Create a function from the assembly
CREATE FUNCTION dbo.RollingSumTvf (@rollingPeriodDays INT)
RETURNS TABLE ( ProductId INT, TransactionDate DATETIME, ReferenceOrderID INT,
ActualCost FLOAT, PrevCumulativeSum FLOAT, RollingSum FLOAT)
-- The function yields rows in order, so let SQL Server know to avoid an extra sort
ORDER (ProductID, TransactionDate, ReferenceOrderID)
AS EXTERNAL NAME ClrPlayground.UserDefinedFunctions.RollingSumTvf;
GO
-- Now we can actually use the TVF!
SELECT *
FROM dbo.RollingSumTvf(45)
ORDER BY ProductId, TransactionDate, ReferenceOrderId
GO
Run Code Online (Sandbox Code Playgroud)
注意事项
CLR 方法为优化算法提供了更大的灵活性,并且可能由 C# 专家进一步调整。但是,CLR 策略也有缺点。需要牢记以下几点:
TRUSTWORTHY并授予EXTERNAL_ACCESSCLR 程序集来获得更高级别的权限。因此存在一些麻烦和潜在的安全隐患,但回报是一种流方法,它可以比 AdventureWorks 上的数据集更好地扩展到更大的数据集。在尝试创造性地思考了一段时间后,我想我也会发布一个相当简单实用的方法,如果它出现在我的日常工作中,我可能会选择解决这个问题。它确实利用了 SQL 2012+ 窗口功能,但不是问题所希望的开创性方式:
-- Compute all running costs into a #temp table; Note that this query could simply read
-- from Production.TransactionHistory, but a CROSS APPLY by product allows the window
-- function to be computed independently per product, supporting a parallel query plan
SELECT t.*
INTO #runningCosts
FROM Production.Product p
CROSS APPLY (
SELECT t.ProductId, t.TransactionDate, t.ReferenceOrderId, t.ActualCost,
-- Running sum of the cost for this product, including all ties on TransactionDate
SUM(t.ActualCost) OVER (
ORDER BY t.TransactionDate
RANGE UNBOUNDED PRECEDING) AS RunningCost
FROM Production.TransactionHistory t
WHERE t.ProductId = p.ProductId
) t
GO
-- Key the table in our output order
ALTER TABLE #runningCosts
ADD PRIMARY KEY (ProductId, TransactionDate, ReferenceOrderId)
GO
SELECT r.ProductId, r.TransactionDate, r.ReferenceOrderId, r.ActualCost,
-- Cumulative running cost - running cost prior to the sliding window
r.RunningCost - ISNULL(w.RunningCost,0) AS RollingSum45
FROM #runningCosts r
OUTER APPLY (
-- For each transaction, find the running cost just before the sliding window begins
SELECT TOP 1 b.RunningCost
FROM #runningCosts b
WHERE b.ProductId = r.ProductId
AND b.TransactionDate < DATEADD(DAY, -45, r.TransactionDate)
ORDER BY b.TransactionDate DESC
) w
ORDER BY r.ProductId, r.TransactionDate, r.ReferenceOrderId
GO
Run Code Online (Sandbox Code Playgroud)
这实际上产生了一个相当简单的整体查询计划,即使同时查看两个相关的查询计划:
我喜欢这种方法的几个原因:
900ms提供的数据集上运行,而不是2700ms原始循环搜索的几个潜在的警告:
Vla*_*nov 26
这是一个很长的答案,所以我决定在这里添加一个摘要。
ProductIDs每个产品的日期范围列表,总结每天的成本(因为有多个具有相同日期的交易),将结果与原始行连接。ProductIDs.我将使用AdventureWorks2014数据库和 SQL Server Express 2014。
对原始数据库的更改:
[Production].[TransactionHistory].[TransactionDate]从更改datetime为date。无论如何,时间分量为零。[dbo].[Calendar][Production].[TransactionHistory].
CREATE TABLE [dbo].[Calendar]
(
[dt] [date] NOT NULL,
CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED
(
[dt] ASC
))
CREATE UNIQUE NONCLUSTERED INDEX [i] ON [Production].[TransactionHistory]
(
[ProductID] ASC,
[TransactionDate] ASC,
[ReferenceOrderID] ASC
)
INCLUDE ([ActualCost])
-- Init calendar table
INSERT INTO dbo.Calendar (dt)
SELECT TOP (50000)
DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '2000-01-01') AS dt
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);
Run Code Online (Sandbox Code Playgroud)
关于OVER子句的MSDN 文章链接到Itzik Ben-Gan撰写的关于窗口函数的优秀博客文章。在那篇文章中,他解释了OVER工作原理、选项ROWS和RANGE选项之间的区别,并提到了在日期范围内计算滚动总和的问题。他提到当前版本的 SQL Server 没有完全实现RANGE,也没有实现时间间隔数据类型。他的区别的解释ROWS,并RANGE给了我一个想法。
没有间隔和重复的日期
如果TransactionHistory表中包含没有间隔且没有重复的日期,则以下查询将产生正确的结果:
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45 = SUM(TH.ActualCost) OVER (
PARTITION BY TH.ProductID
ORDER BY TH.TransactionDate
ROWS BETWEEN
45 PRECEDING
AND CURRENT ROW)
FROM Production.TransactionHistory AS TH
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Run Code Online (Sandbox Code Playgroud)
事实上,一个 45 行的窗口正好可以覆盖 45 天。
没有重复的有间隔的日期
不幸的是,我们的数据在日期上存在差距。为了解决这个问题,我们可以使用一个Calendar表来生成一组没有间隙的日期,然后将LEFT JOIN原始数据添加到该集合中,并使用与ROWS BETWEEN 45 PRECEDING AND CURRENT ROW. 仅当日期不重复(在同一 内ProductID)时,才会产生正确的结果。
有重复间隔的日期
不幸的是,我们的数据在日期上存在差距,并且日期可以在同一个ProductID. 为了解决这个问题,我们可以通过GROUP原始数据ProductID, TransactionDate生成一组没有重复的日期。然后使用Calendartable 生成一组没有间隙的日期。然后我们可以使用查询 withROWS BETWEEN 45 PRECEDING AND CURRENT ROW来计算滚动SUM。这将产生正确的结果。请参阅下面查询中的评论。
WITH
-- calculate Start/End dates for each product
CTE_Products
AS
(
SELECT TH.ProductID
,MIN(TH.TransactionDate) AS MinDate
,MAX(TH.TransactionDate) AS MaxDate
FROM [Production].[TransactionHistory] AS TH
GROUP BY TH.ProductID
)
-- generate set of dates without gaps for each product
,CTE_ProductsWithDates
AS
(
SELECT CTE_Products.ProductID, C.dt
FROM
CTE_Products
INNER JOIN dbo.Calendar AS C ON
C.dt >= CTE_Products.MinDate AND
C.dt <= CTE_Products.MaxDate
)
-- generate set of dates without duplicates for each product
-- calculate daily cost as well
,CTE_DailyCosts
AS
(
SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost
FROM [Production].[TransactionHistory] AS TH
GROUP BY TH.ProductID, TH.TransactionDate
)
-- calculate rolling sum over 45 days
,CTE_Sum
AS
(
SELECT
CTE_ProductsWithDates.ProductID
,CTE_ProductsWithDates.dt
,CTE_DailyCosts.DailyActualCost
,SUM(CTE_DailyCosts.DailyActualCost) OVER (
PARTITION BY CTE_ProductsWithDates.ProductID
ORDER BY CTE_ProductsWithDates.dt
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
FROM
CTE_ProductsWithDates
LEFT JOIN CTE_DailyCosts ON
CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND
CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt
)
-- remove rows that were added by Calendar, which fill the gaps in dates
-- add back duplicate dates that were removed by GROUP BY
SELECT
TH.ProductID
,TH.TransactionDate
,TH.ActualCost
,CTE_Sum.RollingSum45
FROM
[Production].[TransactionHistory] AS TH
INNER JOIN CTE_Sum ON
CTE_Sum.ProductID = TH.ProductID AND
CTE_Sum.dt = TH.TransactionDate
ORDER BY
TH.ProductID
,TH.TransactionDate
,TH.ReferenceOrderID
;
Run Code Online (Sandbox Code Playgroud)
我确认此查询产生与使用子查询的问题的方法相同的结果。
执行计划
第一个查询使用子查询,第二个 - 这种方法。您可以看到这种方法的持续时间和读取次数要少得多。这种方法中的大部分估计成本是最终的ORDER BY,见下文。
子查询方法有一个带有嵌套循环和O(n*n)复杂性的简单计划。
计划此方法扫描TransactionHistory多次,但没有循环。如您所见,估计成本的 70% 以上是Sort最终ORDER BY.
顶部结果 - subquery,底部 - OVER。
上面的计划中最后的Index Scan、Merge Join和Sort是由final INNER JOINwith原表引起的,使最终结果与slow approach with subquery完全一样。返回的行数与TransactionHistory表中的相同。TransactionHistory同一产品在同一天发生多笔交易时,会出现行。如果可以在结果中只显示每日摘要,那么这个 finalJOIN可以被删除,查询变得更简单、更快。上一个计划中的最后一个 Index Scan、Merge Join 和 Sort 被替换为 Filter,它删除了添加的行Calendar。
WITH
-- two scans
-- calculate Start/End dates for each product
CTE_Products
AS
(
SELECT TH.ProductID
,MIN(TH.TransactionDate) AS MinDate
,MAX(TH.TransactionDate) AS MaxDate
FROM [Production].[TransactionHistory] AS TH
GROUP BY TH.ProductID
)
-- generate set of dates without gaps for each product
,CTE_ProductsWithDates
AS
(
SELECT CTE_Products.ProductID, C.dt
FROM
CTE_Products
INNER JOIN dbo.Calendar AS C ON
C.dt >= CTE_Products.MinDate AND
C.dt <= CTE_Products.MaxDate
)
-- generate set of dates without duplicates for each product
-- calculate daily cost as well
,CTE_DailyCosts
AS
(
SELECT TH.ProductID, TH.TransactionDate, SUM(ActualCost) AS DailyActualCost
FROM [Production].[TransactionHistory] AS TH
GROUP BY TH.ProductID, TH.TransactionDate
)
-- calculate rolling sum over 45 days
,CTE_Sum
AS
(
SELECT
CTE_ProductsWithDates.ProductID
,CTE_ProductsWithDates.dt
,CTE_DailyCosts.DailyActualCost
,SUM(CTE_DailyCosts.DailyActualCost) OVER (
PARTITION BY CTE_ProductsWithDates.ProductID
ORDER BY CTE_ProductsWithDates.dt
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
FROM
CTE_ProductsWithDates
LEFT JOIN CTE_DailyCosts ON
CTE_DailyCosts.ProductID = CTE_ProductsWithDates.ProductID AND
CTE_DailyCosts.TransactionDate = CTE_ProductsWithDates.dt
)
-- remove rows that were added by Calendar, which fill the gaps in dates
SELECT
CTE_Sum.ProductID
,CTE_Sum.dt AS TransactionDate
,CTE_Sum.DailyActualCost
,CTE_Sum.RollingSum45
FROM CTE_Sum
WHERE CTE_Sum.DailyActualCost IS NOT NULL
ORDER BY
CTE_Sum.ProductID
,CTE_Sum.dt
;
Run Code Online (Sandbox Code Playgroud)
尽管如此,TransactionHistory还是被扫描了两次。需要额外扫描一次才能获得每个产品的日期范围。我很感兴趣,看看它如何与另一种方法,在这里我们使用对全球范围内的日期外部知识进行比较TransactionHistory,再加上额外的表Product具有所有ProductIDs以避免额外的扫描。我从此查询中删除了每天交易数量的计算以使比较有效。它可以添加到两个查询中,但我想保持简单以进行比较。我还不得不使用其他日期,因为我使用的是 2014 版本的数据库。
DECLARE @minAnalysisDate DATE = '2013-07-31',
-- Customizable start date depending on business needs
@maxAnalysisDate DATE = '2014-08-03'
-- Customizable end date depending on business needs
SELECT
-- one scan
ProductID, TransactionDate, ActualCost, RollingSum45
--, NumOrders
FROM (
SELECT ProductID, TransactionDate,
--NumOrders,
ActualCost,
SUM(ActualCost) OVER (
PARTITION BY ProductId ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW
) AS RollingSum45
FROM (
-- The full cross-product of products and dates,
-- combined with actual cost information for that product/date
SELECT p.ProductID, c.dt AS TransactionDate,
--COUNT(TH.ProductId) AS NumOrders,
SUM(TH.ActualCost) AS ActualCost
FROM Production.Product p
JOIN dbo.calendar c
ON c.dt BETWEEN @minAnalysisDate AND @maxAnalysisDate
LEFT OUTER JOIN Production.TransactionHistory TH
ON TH.ProductId = p.productId
AND TH.TransactionDate = c.dt
GROUP BY P.ProductID, c.dt
) aggsByDay
) rollingSums
--WHERE NumOrders > 0
WHERE ActualCost IS NOT NULL
ORDER BY ProductID, TransactionDate
-- MAXDOP 1 to avoid parallel scan inflating the scan count
OPTION (MAXDOP 1);
Run Code Online (Sandbox Code Playgroud)
两个查询以相同的顺序返回相同的结果。
比较
这是时间和 IO 统计信息。
双扫描变体速度更快,读取次数更少,因为单扫描变体必须大量使用 Worktable。此外,正如您在计划中看到的那样,单次扫描变体生成的行比需要的多。它为表ProductID中的每个生成日期Product,即使 aProductID没有任何事务。Product表中有 504 行,但只有 441 个产品在TransactionHistory. 此外,它为每个产品生成相同的日期范围,这超出了需要。如果TransactionHistory整体历史较长,每个产品的历史相对较短,多余的不需要的行数会更高。
另一方面,可以通过仅在 上创建另一个更窄的索引来进一步优化双扫描变量(ProductID, TransactionDate)。此索引将用于计算每个产品 ( CTE_Products) 的开始/结束日期,它的页数将少于覆盖索引,因此会导致读取次数减少。
所以,我们可以选择,要么有一个额外的显式简单扫描,要么有一个隐式工作表。
顺便说一句,如果只包含每日摘要的结果是可以的,那么最好创建一个不包含ReferenceOrderID. 它将使用更少的页面 => 更少的 IO。
CREATE NONCLUSTERED INDEX [i2] ON [Production].[TransactionHistory]
(
[ProductID] ASC,
[TransactionDate] ASC
)
INCLUDE ([ActualCost])
Run Code Online (Sandbox Code Playgroud)
它变成了一个很长的答案,但这里还有一个变体,它再次只返回每日摘要,但它只对数据进行一次扫描,并且不需要有关日期范围或 ProductID 列表的外部知识。它也不做中间排序。整体性能与之前的变体相似,但似乎更差一些。
主要思想是使用一个数字表来生成可以填补日期空白的行。对于每个现有日期,用于LEAD计算以天为单位的差距大小,然后用于CROSS APPLY将所需的行数添加到结果集中。起初,我用一个永久的数字表尝试了它。该计划在该表中显示了大量读取,尽管实际持续时间与我使用CTE.
WITH
e1(n) AS
(
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
) -- 10
,e2(n) AS (SELECT 1 FROM e1 CROSS JOIN e1 AS b) -- 10*10
,e3(n) AS (SELECT 1 FROM e1 CROSS JOIN e2) -- 10*100
,CTE_Numbers
AS
(
SELECT ROW_NUMBER() OVER (ORDER BY n) AS Number
FROM e3
)
,CTE_DailyCosts
AS
(
SELECT
TH.ProductID
,TH.TransactionDate
,SUM(ActualCost) AS DailyActualCost
,ISNULL(DATEDIFF(day,
TH.TransactionDate,
LEAD(TH.TransactionDate)
OVER(PARTITION BY TH.ProductID ORDER BY TH.TransactionDate)), 1) AS DiffDays
FROM [Production].[TransactionHistory] AS TH
GROUP BY TH.ProductID, TH.TransactionDate
)
,CTE_NoGaps
AS
(
SELECT
CTE_DailyCosts.ProductID
,CTE_DailyCosts.TransactionDate
,CASE WHEN CA.Number = 1
THEN CTE_DailyCosts.DailyActualCost
ELSE NULL END AS DailyCost
FROM
CTE_DailyCosts
CROSS APPLY
(
SELECT TOP(CTE_DailyCosts.DiffDays) CTE_Numbers.Number
FROM CTE_Numbers
ORDER BY CTE_Numbers.Number
) AS CA
)
,CTE_Sum
AS
(
SELECT
ProductID
,TransactionDate
,DailyCost
,SUM(DailyCost) OVER (
PARTITION BY ProductID
ORDER BY TransactionDate
ROWS BETWEEN 45 PRECEDING AND CURRENT ROW) AS RollingSum45
FROM CTE_NoGaps
)
SELECT
ProductID
,TransactionDate
,DailyCost
,RollingSum45
FROM CTE_Sum
WHERE DailyCost IS NOT NULL
ORDER BY
ProductID
,TransactionDate
;
Run Code Online (Sandbox Code Playgroud)
这个计划“更长”,因为查询使用了两个窗口函数(LEAD和SUM)。
Pau*_*ite 24
执行速度更快且需要更少内存的替代 SQLCLR 解决方案:
这需要EXTERNAL_ACCESS权限集,因为它使用到目标服务器和数据库的环回连接而不是(慢)上下文连接。这是调用函数的方法:
SELECT
RS.ProductID,
RS.TransactionDate,
RS.ActualCost,
RS.RollingSum45
FROM dbo.RollingSum
(
N'.\SQL2014', -- Instance name
N'AdventureWorks2012' -- Database name
) AS RS
ORDER BY
RS.ProductID,
RS.TransactionDate,
RS.ReferenceOrderID;
Run Code Online (Sandbox Code Playgroud)
以相同的顺序产生与问题完全相同的结果。
执行计划:
Profiler 逻辑读取:481
这种实现的主要优点是它比使用上下文连接更快,并且使用更少的内存。它在任何时候只在内存中保存两件事:
这种最小的缓存应该确保这个方法可以很好地扩展;肯定比试图将整个输入集保存在 CLR 内存中要好。
Mik*_*son 17
如果您使用的是 SQL Server 2014 的 64 位企业版、开发者版或评估版,则可以使用内存中 OLTP。该解决方案将不是单次扫描,并且几乎不会使用任何窗口函数,但它可能会为这个问题增加一些价值,并且所使用的算法可能会被用作其他解决方案的灵感。
首先,您需要在 AdventureWorks 数据库上启用内存中 OLTP。
alter database AdventureWorks2014
add filegroup InMem contains memory_optimized_data;
alter database AdventureWorks2014
add file (name='AW2014_InMem',
filename='D:\SQL Server\MSSQL12.MSSQLSERVER\MSSQL\DATA\AW2014')
to filegroup InMem;
alter database AdventureWorks2014
set memory_optimized_elevate_to_snapshot = on;
Run Code Online (Sandbox Code Playgroud)
该过程的参数是一个 In-Memory 表变量,必须定义为一个类型。
create type dbo.TransHistory as table
(
ID int not null,
ProductID int not null,
TransactionDate datetime not null,
ReferenceOrderID int not null,
ActualCost money not null,
RunningTotal money not null,
RollingSum45 money not null,
-- Index used in while loop
index IX_T1 nonclustered hash (ID) with (bucket_count = 1000000),
-- Used to lookup the running total as it was 45 days ago (or more)
index IX_T2 nonclustered (ProductID, TransactionDate desc)
) with (memory_optimized = on);
Run Code Online (Sandbox Code Playgroud)
ID不在该表中唯一的,它是用于的每个组合独特ProductID和TransactionDate。
过程中有一些注释会告诉您它的作用,但总体而言它是在循环中计算运行总数,并且对于每次迭代,它都会查找 45 天前(或更长时间)的运行总数。
当前运行总计减去 45 天前的运行总计是我们正在寻找的滚动 45 天总和。
create procedure dbo.GetRolling45
@TransHistory dbo.TransHistory readonly
with native_compilation, schemabinding, execute as owner as
begin atomic with(transaction isolation level = snapshot, language = N'us_english')
-- Table to hold the result
declare @TransRes dbo.TransHistory;
-- Loop variable
declare @ID int = 0;
-- Current ProductID
declare @ProductID int = -1;
-- Previous ProductID used to restart the running total
declare @PrevProductID int;
-- Current transaction date used to get the running total 45 days ago (or more)
declare @TransactionDate datetime;
-- Sum of actual cost for the group ProductID and TransactionDate
declare @ActualCost money;
-- Running total so far
declare @RunningTotal money = 0;
-- Running total as it was 45 days ago (or more)
declare @RunningTotal45 money = 0;
-- While loop for each unique occurence of the combination of ProductID, TransactionDate
while @ProductID <> 0
begin
set @ID += 1;
set @PrevProductID = @ProductID;
-- Get the current values
select @ProductID = min(ProductID),
@TransactionDate = min(TransactionDate),
@ActualCost = sum(ActualCost)
from @TransHistory
where ID = @ID;
if @ProductID <> 0
begin
set @RunningTotal45 = 0;
if @ProductID <> @PrevProductID
begin
-- New product, reset running total
set @RunningTotal = @ActualCost;
end
else
begin
-- Same product as last row, aggregate running total
set @RunningTotal += @ActualCost;
-- Get the running total as it was 45 days ago (or more)
select top(1) @RunningTotal45 = TR.RunningTotal
from @TransRes as TR
where TR.ProductID = @ProductID and
TR.TransactionDate < dateadd(day, -45, @TransactionDate)
order by TR.TransactionDate desc;
end;
-- Add all rows that match ID to the result table
-- RollingSum45 is calculated by using the current running total and the running total as it was 45 days ago (or more)
insert into @TransRes(ID, ProductID, TransactionDate, ReferenceOrderID, ActualCost, RunningTotal, RollingSum45)
select @ID,
@ProductID,
@TransactionDate,
TH.ReferenceOrderID,
TH.ActualCost,
@RunningTotal,
@RunningTotal - @RunningTotal45
from @TransHistory as TH
where ID = @ID;
end
end;
-- Return the result table to caller
select TR.ProductID, TR.TransactionDate, TR.ReferenceOrderID, TR.ActualCost, TR.RollingSum45
from @TransRes as TR
order by TR.ProductID, TR.TransactionDate, TR.ReferenceOrderID;
end;
Run Code Online (Sandbox Code Playgroud)
像这样调用程序。
-- Parameter to stored procedure GetRollingSum
declare @T dbo.TransHistory;
-- Load data to in-mem table
-- ID is unique for each combination of ProductID, TransactionDate
insert into @T(ID, ProductID, TransactionDate, ReferenceOrderID, ActualCost, RunningTotal, RollingSum45)
select dense_rank() over(order by TH.ProductID, TH.TransactionDate),
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID,
TH.ActualCost,
0,
0
from Production.TransactionHistory as TH;
-- Get the rolling 45 days sum
exec dbo.GetRolling45 @T;
Run Code Online (Sandbox Code Playgroud)
在我的计算机上对此进行测试 Client Statistics 报告总执行时间约为 750 毫秒。对于比较,子查询版本需要 3.5 秒。
额外的闲聊:
这个算法也可以被常规的 T-SQL 使用。计算运行总数,range不使用行,并将结果存储在临时表中。然后,您可以使用自联接查询该表到 45 天前的运行总计,并计算滚动总和。但是,由于需要以不同的方式处理 order by 子句的重复项,因此rangecompare to的实现rows非常缓慢,因此我没有使用这种方法获得所有良好的性能。解决方法可能是使用另一个窗口函数,例如last_value()在计算的运行总数上使用rows来模拟range运行总数。另一种方法是使用max() over(). 两者都有一些问题。使用合适的索引来避免排序和避免假脱机max() over()版本。我放弃了优化这些东西,但如果你对我到目前为止的代码感兴趣,请告诉我。
Ken*_*her 13
嗯,这很有趣 :) 我的解决方案比 @GeoffPatterson 的解决方案慢一点,但部分原因是我将回溯到原始表以消除 Geoff 的假设之一(即每个产品/日期对一行) . 我假设这是最终查询的简化版本,可能需要原始表中的其他信息。
注意:我借用了 Geoff 的日历表,实际上最终得到了一个非常相似的解决方案:
-- Build calendar table for 2000 ~ 2020
CREATE TABLE dbo.calendar (d DATETIME NOT NULL CONSTRAINT PK_calendar PRIMARY KEY)
GO
DECLARE @d DATETIME = '1/1/2000'
WHILE (@d < '1/1/2021')
BEGIN
INSERT INTO dbo.calendar (d) VALUES (@d)
SELECT @d = DATEADD(DAY, 1, @d)
END
Run Code Online (Sandbox Code Playgroud)
这是查询本身:
WITH myCTE AS (SELECT PP.ProductID, calendar.d AS TransactionDate,
SUM(ActualCost) AS CostPerDate
FROM Production.Product PP
CROSS JOIN calendar
LEFT OUTER JOIN Production.TransactionHistory PTH
ON PP.ProductID = PTH.ProductID
AND calendar.d = PTH.TransactionDate
CROSS APPLY (SELECT MAX(TransactionDate) AS EndDate,
MIN(TransactionDate) AS StartDate
FROM Production.TransactionHistory) AS Boundaries
WHERE calendar.d BETWEEN Boundaries.StartDate AND Boundaries.EndDate
GROUP BY PP.ProductID, calendar.d),
RunningTotal AS (
SELECT ProductId, TransactionDate, CostPerDate AS TBE,
SUM(myCTE.CostPerDate) OVER (
PARTITION BY myCTE.ProductID
ORDER BY myCTE.TransactionDate
ROWS BETWEEN
45 PRECEDING
AND CURRENT ROW) AS RollingSum45
FROM myCTE)
SELECT
TH.ProductID,
TH.TransactionDate,
TH.ActualCost,
RollingSum45
FROM Production.TransactionHistory AS TH
JOIN RunningTotal
ON TH.ProductID = RunningTotal.ProductID
AND TH.TransactionDate = RunningTotal.TransactionDate
WHERE RunningTotal.TBE IS NOT NULL
ORDER BY
TH.ProductID,
TH.TransactionDate,
TH.ReferenceOrderID;
Run Code Online (Sandbox Code Playgroud)
基本上我决定处理它的最简单方法是使用 ROWS 子句的选项。但这要求我每个ProductID,TransactionDate组合只有一行,而不仅仅是这样,但我必须每个ProductID和有一行possible date。我在 CTE 中结合了 Product、calendar 和 TransactionHistory 表。然后我不得不创建另一个 CTE 来生成滚动信息。我必须这样做,因为如果我将它直接加入到原始表中,我会得到行消除,这会影响我的结果。之后,将我的第二个 CTE 加入到原始表中是一件简单的事情。我确实添加了TBE列(要消除)以摆脱在 CTE 中创建的空白行。我CROSS APPLY还在初始 CTE 中使用了 a来为我的日历表生成边界。
然后我添加了推荐的索引:
CREATE NONCLUSTERED INDEX [TransactionHistory_IX1]
ON [Production].[TransactionHistory] ([TransactionDate])
INCLUDE ([ProductID],[ReferenceOrderID],[ActualCost])
Run Code Online (Sandbox Code Playgroud)
并得到了最终的执行计划:
编辑:最后,我在日历表上添加了一个索引,以合理的幅度提高了性能。
CREATE INDEX ix_calendar ON calendar(d)
Run Code Online (Sandbox Code Playgroud)