34 sql-server t-sql
正如标题所暗示的,我需要一些帮助来获得 T-SQL 的运行总数。问题是我需要做的总和是计数的总和:
sum(count (distinct (customers)))
Run Code Online (Sandbox Code Playgroud)
假设我单独进行计数,结果将是:
Day | CountCustomers
----------------------
5/1 | 1
5/2 | 0
5/3 | 5
Run Code Online (Sandbox Code Playgroud)
我需要输出总和为:
Day | RunningTotalCustomers
----------------------
5/1 | 1
5/2 | 1
5/3 | 6
Run Code Online (Sandbox Code Playgroud)
在使用该coalesce方法之前,我已经完成了运行总计,但从未进行过计数。我现在不知道该怎么做。
Aar*_*and 53
以下是一些您可以比较的方法。首先让我们建立一个包含一些虚拟数据的表。我正在用来自 sys.all_columns 的一堆随机数据填充它。嗯,这有点随机 - 我确保日期是连续的(这对于答案之一真的很重要)。
CREATE TABLE dbo.Hits(Day SMALLDATETIME, CustomerID INT);
CREATE CLUSTERED INDEX x ON dbo.Hits([Day]);
INSERT dbo.Hits SELECT TOP (5000) DATEADD(DAY, r, '20120501'),
COALESCE(ASCII(SUBSTRING(name, s, 1)), 86)
FROM (SELECT name, r = ROW_NUMBER() OVER (ORDER BY name)/10,
s = CONVERT(INT, RIGHT(CONVERT(VARCHAR(20), [object_id]), 1))
FROM sys.all_columns) AS x;
SELECT
Earliest_Day = MIN([Day]),
Latest_Day = MAX([Day]),
Unique_Days = DATEDIFF(DAY, MIN([Day]), MAX([Day])) + 1,
Total_Rows = COUNT(*)
FROM dbo.Hits;
Run Code Online (Sandbox Code Playgroud)
结果:
Earliest_Day Latest_Day Unique_Days Total_Days
------------------- ------------------- ----------- ----------
2012-05-01 00:00:00 2013-09-13 00:00:00 501 5000
Run Code Online (Sandbox Code Playgroud)
数据看起来像这样(5000 行) - 但在您的系统上看起来会略有不同,具体取决于版本和构建 #:
Day CustomerID
------------------- ---
2012-05-01 00:00:00 95
2012-05-01 00:00:00 97
2012-05-01 00:00:00 97
2012-05-01 00:00:00 117
2012-05-01 00:00:00 100
...
2012-05-02 00:00:00 110
2012-05-02 00:00:00 110
2012-05-02 00:00:00 95
...
Run Code Online (Sandbox Code Playgroud)
运行总计结果应如下所示(501 行):
Day c rt
------------------- -- --
2012-05-01 00:00:00 6 6
2012-05-02 00:00:00 5 11
2012-05-03 00:00:00 4 15
2012-05-04 00:00:00 7 22
2012-05-05 00:00:00 6 28
...
Run Code Online (Sandbox Code Playgroud)
所以我要比较的方法是:
这是人们在警告您远离游标时会告诉您这样做的方式,因为“基于集合总是更快”。在最近的一些实验中,我发现光标超过了这个解决方案。
;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
)
SELECT g.[Day], g.c, rt = SUM(g2.c)
FROM g INNER JOIN g AS g2
ON g.[Day] >= g2.[Day]
GROUP BY g.[Day], g.c
ORDER BY g.[Day];
Run Code Online (Sandbox Code Playgroud)
提醒 - 这依赖于连续的日期(无间隙),最多 10000 级递归,并且您知道您感兴趣的范围的开始日期(设置锚点)。当然,您可以使用子查询动态设置锚点,但我想让事情保持简单。
;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
), x AS
(
SELECT [Day], c, rt = c
FROM g
WHERE [Day] = '20120501'
UNION ALL
SELECT g.[Day], g.c, x.rt + g.c
FROM x INNER JOIN g
ON g.[Day] = DATEADD(DAY, 1, x.[Day])
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);
Run Code Online (Sandbox Code Playgroud)
Row_number 计算在这里有点昂贵。这再次支持 10000 的最大递归级别,但您不需要分配锚点。
;WITH g AS
(
SELECT [Day], rn = ROW_NUMBER() OVER (ORDER BY DAY),
c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
), x AS
(
SELECT [Day], rn, c, rt = c
FROM g
WHERE rn = 1
UNION ALL
SELECT g.[Day], g.rn, g.c, x.rt + g.c
FROM x INNER JOIN g
ON g.rn = x.rn + 1
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);
Run Code Online (Sandbox Code Playgroud)
按照建议,从 Mikael 的答案中窃取,将其包含在测试中。
CREATE TABLE #Hits
(
rn INT PRIMARY KEY,
c INT,
[Day] SMALLDATETIME
);
INSERT INTO #Hits (rn, c, Day)
SELECT ROW_NUMBER() OVER (ORDER BY DAY),
COUNT(DISTINCT CustomerID),
[Day]
FROM dbo.Hits
GROUP BY [Day];
WITH x AS
(
SELECT [Day], rn, c, rt = c
FROM #Hits as c
WHERE rn = 1
UNION ALL
SELECT g.[Day], g.rn, g.c, x.rt + g.c
FROM x INNER JOIN #Hits as g
ON g.rn = x.rn + 1
)
SELECT [Day], c, rt
FROM x
ORDER BY [Day]
OPTION (MAXRECURSION 10000);
DROP TABLE #Hits;
Run Code Online (Sandbox Code Playgroud)
同样,我只是为了完整性才包括这个;我个人不会依赖这个解决方案,因为正如我在另一个答案中提到的,这种方法根本不能保证有效,并且可能会在 SQL Server 的未来版本中完全失效。(我正在尽我最大的努力强迫 SQL Server 遵守我想要的顺序,使用索引选择的提示。)
CREATE TABLE #x([Day] SMALLDATETIME, c INT, rt INT);
CREATE UNIQUE CLUSTERED INDEX x ON #x([Day]);
INSERT #x([Day], c)
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
ORDER BY [Day];
DECLARE @rt1 INT;
SET @rt1 = 0;
UPDATE #x
SET @rt1 = rt = @rt1 + c
FROM #x WITH (INDEX = x);
SELECT [Day], c, rt FROM #x ORDER BY [Day];
DROP TABLE #x;
Run Code Online (Sandbox Code Playgroud)
“小心,这里有游标!游标是邪恶的!你应该不惜一切代价避免游标!” 不,那不是我在说,这只是我经常听到的。与流行的观点相反,在某些情况下游标是合适的。
CREATE TABLE #x2([Day] SMALLDATETIME, c INT, rt INT);
CREATE UNIQUE CLUSTERED INDEX x ON #x2([Day]);
INSERT #x2([Day], c)
SELECT [Day], COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
ORDER BY [Day];
DECLARE @rt2 INT, @d SMALLDATETIME, @c INT;
SET @rt2 = 0;
DECLARE c CURSOR LOCAL STATIC READ_ONLY FORWARD_ONLY
FOR SELECT [Day], c FROM #x2 ORDER BY [Day];
OPEN c;
FETCH NEXT FROM c INTO @d, @c;
WHILE @@FETCH_STATUS = 0
BEGIN
SET @rt2 = @rt2 + @c;
UPDATE #x2 SET rt = @rt2 WHERE [Day] = @d;
FETCH NEXT FROM c INTO @d, @c;
END
SELECT [Day], c, rt FROM #x2 ORDER BY [Day];
DROP TABLE #x2;
Run Code Online (Sandbox Code Playgroud)
如果您使用的是最新版本的 SQL Server,窗口功能的增强使我们能够轻松计算运行总数,而无需自加入的指数成本(SUM 是一次性计算的)、CTE 的复杂性(包括要求连续行以获得更好的 CTE),不受支持的古怪更新和禁止的游标。请注意使用RANGE和之间的区别ROWS,或者根本不指定 - 只会ROWS避免磁盘假脱机,否则会显着影响性能。
;WITH g AS
(
SELECT [Day], c = COUNT(DISTINCT CustomerID)
FROM dbo.Hits
GROUP BY [Day]
)
SELECT g.[Day], c,
rt = SUM(c) OVER (ORDER BY [Day] ROWS UNBOUNDED PRECEDING)
FROM g
ORDER BY g.[Day];
Run Code Online (Sandbox Code Playgroud)
我采用了每种方法并使用以下方法将其打包成一批:
SELECT SYSUTCDATETIME();
GO
DBCC DROPCLEANBUFFERS;DBCC FREEPROCCACHE;
-- query here
GO 10
SELECT SYSUTCDATETIME();
Run Code Online (Sandbox Code Playgroud)
以下是总持续时间的结果,以毫秒为单位(请记住,这也包括每次的 DBCC 命令):
method run 1 run 2
----------------------------- -------- --------
self-join 1296 ms 1357 ms -- "supported" non-SQL 2012 winner
recursive cte with dates 1655 ms 1516 ms
recursive cte with row_number 19747 ms 19630 ms
recursive cte with #temp table 1624 ms 1329 ms
quirky update 880 ms 1030 ms -- non-SQL 2012 winner
cursor 1962 ms 1850 ms
SQL Server 2012 847 ms 917 ms -- winner if SQL 2012 available
Run Code Online (Sandbox Code Playgroud)
我在没有 DBCC 命令的情况下又做了一次:
method run 1 run 2
----------------------------- -------- --------
self-join 1272 ms 1309 ms -- "supported" non-SQL 2012 winner
recursive cte with dates 1247 ms 1593 ms
recursive cte with row_number 18646 ms 18803 ms
recursive cte with #temp table 1340 ms 1564 ms
quirky update 1024 ms 1116 ms -- non-SQL 2012 winner
cursor 1969 ms 1835 ms
SQL Server 2012 600 ms 569 ms -- winner if SQL 2012 available
Run Code Online (Sandbox Code Playgroud)
删除 DBCC 和循环,只测量一次原始迭代:
method run 1 run 2
----------------------------- -------- --------
self-join 313 ms 242 ms
recursive cte with dates 217 ms 217 ms
recursive cte with row_number 2114 ms 1976 ms
recursive cte with #temp table 83 ms 116 ms -- "supported" non-SQL 2012 winner
quirky update 86 ms 85 ms -- non-SQL 2012 winner
cursor 1060 ms 983 ms
SQL Server 2012 68 ms 40 ms -- winner if SQL 2012 available
Run Code Online (Sandbox Code Playgroud)
最后,我将源表中的行数乘以 10(将 top 更改为 50000 并添加另一个表作为交叉连接)。结果,没有 DBCC 命令的单次迭代(只是为了节省时间):
method run 1 run 2
----------------------------- -------- --------
self-join 2401 ms 2520 ms
recursive cte with dates 442 ms 473 ms
recursive cte with row_number 144548 ms 147716 ms
recursive cte with #temp table 245 ms 236 ms -- "supported" non-SQL 2012 winner
quirky update 150 ms 148 ms -- non-SQL 2012 winner
cursor 1453 ms 1395 ms
SQL Server 2012 131 ms 133 ms -- winner
Run Code Online (Sandbox Code Playgroud)
我只测量了持续时间——我将把它留给读者作为练习,以比较这些方法对他们的数据,比较其他可能重要的指标(或可能随他们的模式/数据而变化)。在从此答案得出任何结论之前,您需要根据您的数据和架构对其进行测试……随着行数的增加,这些结果几乎肯定会发生变化。
我添加了一个 sqlfiddle。结果:

在我的测试中,选择是:
但同样,您应该针对您的架构和数据测试这些。由于这是一个人为的测试,行数相对较低,因此它也可能是在风中放屁。我已经用不同的模式和行数进行了其他测试,性能启发式完全不同......这就是为什么我对你的原始问题提出了这么多后续问题。
更新
我在这里写了更多关于这个的博客:
运行总计的最佳方法 - 针对 SQL Server 2012 更新
小智 1
显然,这是最优解
DECLARE @dailyCustomers TABLE (day smalldatetime, CountCustomers int, RunningTotal int)
DECLARE @RunningTotal int
SET @RunningTotal = 0
INSERT INTO @dailyCustomers
SELECT day, CountCustomers, null
FROM Sales
ORDER BY day
UPDATE @dailyCustomers
SET @RunningTotal = RunningTotal = @RunningTotal + CountCustomers
FROM @dailyCustomers
SELECT * FROM @dailyCustomers
Run Code Online (Sandbox Code Playgroud)