AA.*_*.SC 12 sql-server sql-server-2008-r2 gaps-and-islands
我正在尝试编写一个查询,我必须通过处理重叠天数来计算客户的访问次数。假设 itemID 2009 的开始日期是 23 日,结束日期是 26 日,因此项目 20010 介于这些天之间,我们不会将此购买日期添加到我们的总数中。
示例场景:
Item ID Start Date End Date Number of days Number of days Candidate for visit count
20009 2015-01-23 2015-01-26 4 4
20010 2015-01-24 2015-01-24 1 0
20011 2015-01-23 2015-01-26 4 0
20012 2015-01-23 2015-01-27 5 1
20013 2015-01-23 2015-01-27 5 0
20014 2015-01-29 2015-01-30 2 2
Run Code Online (Sandbox Code Playgroud)
OutPut 应该是 7 VisitDays
输入表:
CREATE TABLE #Items
(
CustID INT,
ItemID INT,
StartDate DATETIME,
EndDate DATETIME
)
INSERT INTO #Items
SELECT 11205, 20009, '2015-01-23', '2015-01-26'
UNION ALL
SELECT 11205, 20010, '2015-01-24', '2015-01-24'
UNION ALL
SELECT 11205, 20011, '2015-01-23', '2015-01-26'
UNION ALL
SELECT 11205, 20012, '2015-01-23', '2015-01-27'
UNION ALL
SELECT 11205, 20012, '2015-01-23', '2015-01-27'
UNION ALL
SELECT 11205, 20012, '2015-01-28', '2015-01-29'
Run Code Online (Sandbox Code Playgroud)
到目前为止我已经尝试过:
CREATE TABLE #VisitsTable
(
StartDate DATETIME,
EndDate DATETIME
)
INSERT INTO #VisitsTable
SELECT DISTINCT
StartDate,
EndDate
FROM #Items items
WHERE CustID = 11205
ORDER BY StartDate ASC
IF EXISTS (SELECT TOP 1 1 FROM #VisitsTable)
BEGIN
SELECT ISNULL(SUM(VisitDays),1)
FROM ( SELECT DISTINCT
abc.StartDate,
abc.EndDate,
DATEDIFF(DD, abc.StartDate, abc.EndDate) + 1 VisitDays
FROM #VisitsTable abc
INNER JOIN #VisitsTable bc ON bc.StartDate NOT BETWEEN abc.StartDate AND abc.EndDate
) Visits
END
--DROP TABLE #Items
--DROP TABLE #VisitsTable
Run Code Online (Sandbox Code Playgroud)
有很多关于打包时间间隔的问题和文章。例如,Itzik Ben-Gan 的Packing Intervals。
您可以为给定用户打包间隔。一旦打包,就不会有重叠,因此您可以简单地总结打包间隔的持续时间。
如果您的时间间隔是没有时间的日期,我会使用Calendar表格。这张表只是列出了几十年的日期。如果您没有日历表,只需创建一个:
CREATE TABLE [dbo].[Calendar](
[dt] [date] NOT NULL,
CONSTRAINT [PK_Calendar] PRIMARY KEY CLUSTERED
(
[dt] ASC
));
Run Code Online (Sandbox Code Playgroud)
例如,从 1900-01-01 开始的 100K 行(~270 年):
INSERT INTO dbo.Calendar (dt)
SELECT TOP (100000)
DATEADD(day, ROW_NUMBER() OVER (ORDER BY s1.[object_id])-1, '19000101') AS dt
FROM sys.all_objects AS s1 CROSS JOIN sys.all_objects AS s2
OPTION (MAXDOP 1);
Run Code Online (Sandbox Code Playgroud)
另请参阅为什么数字表“无价”?
一旦你有了一张Calendar桌子,这里是如何使用它。
每个原始行都与Calendar表连接以返回与StartDate和之间的日期一样多的行EndDate。
然后我们计算不同的日期,这会删除重叠的日期。
SELECT COUNT(DISTINCT CA.dt) AS TotalCount
FROM
#Items AS T
CROSS APPLY
(
SELECT dbo.Calendar.dt
FROM dbo.Calendar
WHERE
dbo.Calendar.dt >= T.StartDate
AND dbo.Calendar.dt <= T.EndDate
) AS CA
WHERE T.CustID = 11205
;
Run Code Online (Sandbox Code Playgroud)
结果
TotalCount
7
Run Code Online (Sandbox Code Playgroud)
我强烈同意 aNumbers和 a Calendartable 非常有用,如果这个问题可以用 Calendar 表简化很多。
不过,我会建议另一种解决方案(不需要日历表或窗口聚合 - 正如 Itzik 的链接帖子中的一些答案所做的那样)。它可能不是所有情况下最有效的(或者可能是所有情况下最差的!)但我认为测试没有害处。
它的工作原理是首先找到不与其他时间间隔重叠的开始和结束日期,然后将它们放在两行(分别是开始和结束日期)中,以便为它们分配行号,最后将第一个开始日期与第一个结束日期匹配,2号和2号,以此类推:
WITH
start_dates AS
( SELECT CustID, StartDate,
Rn = ROW_NUMBER() OVER (PARTITION BY CustID
ORDER BY StartDate)
FROM items AS i
WHERE NOT EXISTS
( SELECT *
FROM Items AS j
WHERE j.CustID = i.CustID
AND j.StartDate < i.StartDate AND i.StartDate <= j.EndDate
)
GROUP BY CustID, StartDate
),
end_dates AS
( SELECT CustID, EndDate,
Rn = ROW_NUMBER() OVER (PARTITION BY CustID
ORDER BY EndDate)
FROM items AS i
WHERE NOT EXISTS
( SELECT *
FROM Items AS j
WHERE j.CustID = i.CustID
AND j.StartDate <= i.EndDate AND i.EndDate < j.EndDate
)
GROUP BY CustID, EndDate
)
SELECT s.CustID,
Result = SUM( DATEDIFF(day, s.StartDate, e.EndDate) + 1 )
FROM start_dates AS s
JOIN end_dates AS e
ON s.CustID = e.CustID
AND s.Rn = e.Rn
GROUP BY s.CustID ;
Run Code Online (Sandbox Code Playgroud)
两个索引 on(CustID, StartDate, EndDate)和 on(CustID, EndDate, StartDate)对提高查询性能很有用。
与日历(也许是唯一的)相比的一个优势是,它可以轻松地处理datetime值并以不同的精度计算“打包间隔”的长度,更大(周、年)或更小(小时、分钟或秒,毫秒等),而不仅仅是计算日期。分钟或秒精度的日历表会非常大,并且(交叉)将其连接到大表将是一种非常有趣的体验,但可能不是最有效的体验。
(感谢 Vladimir Baranov):很难对性能进行适当的比较,因为不同方法的性能可能取决于数据分布。1) 间隔有多长 - 间隔越短,日历表的性能就越好,因为长的间隔会产生很多中间行 2) 间隔重叠的频率 - 主要是不重叠的间隔与覆盖相同范围的大多数间隔. 我认为 Itzik 解决方案的性能取决于此。可能还有其他方法来扭曲数据,而且很难说各种方法的效率会受到怎样的影响。
第一个查询创建不同的开始日期和结束日期范围,没有重叠。
笔记:
id=0) 与来自 Ypercube ( id=1)的样本混合SELECT DISTINCT its.id
, Start_Date = its.Start_Date
, End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
--, x1=itmax.End_Date, x2=itmin.Start_Date, x3=its.End_Date
FROM @Items its
OUTER APPLY (
SELECT Start_Date = MAX(End_Date) FROM @Items std
WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
) itmin
OUTER APPLY (
SELECT End_Date = MIN(Start_Date) FROM @Items std
WHERE std.Item_ID <> its.Item_ID+1000 AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
) itmax;
Run Code Online (Sandbox Code Playgroud)
id | Start_Date | End_Date
0 | 2015-01-23 00:00:00.0000000 | 2015-01-23 00:00:00.0000000 => 1
0 | 2015-01-24 00:00:00.0000000 | 2015-01-27 00:00:00.0000000 => 4
0 | 2015-01-29 00:00:00.0000000 | 2015-01-30 00:00:00.0000000 => 2
1 | 2016-01-20 00:00:00.0000000 | 2016-01-22 00:00:00.0000000 => 3
1 | 2016-01-23 00:00:00.0000000 | 2016-01-24 00:00:00.0000000 => 2
1 | 2016-01-25 00:00:00.0000000 | 2016-01-29 00:00:00.0000000 => 5
Run Code Online (Sandbox Code Playgroud)
如果您将这些开始日期和结束日期与 DATEDIFF 一起使用:
SELECT DATEDIFF(day
, its.Start_Date
, End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
) + 1
...
Run Code Online (Sandbox Code Playgroud)
输出(有重复)是:
SUM=7)SUM=10)然后,您只需要将所有内容与 a SUMand放在一起GROUP BY:
SELECT id
, Days = SUM(
DATEDIFF(day, Start_Date, End_Date)+1
)
FROM (
SELECT DISTINCT its.id
, Start_Date = its.Start_Date
, End_Date = COALESCE(DATEADD(day, -1, itmax.End_Date), CASE WHEN itmin.Start_Date > its.End_Date THEN itmin.Start_Date ELSE its.End_Date END)
FROM @Items its
OUTER APPLY (
SELECT Start_Date = MAX(End_Date) FROM @Items std
WHERE std.Item_ID <> its.Item_ID AND std.Start_Date < its.Start_Date AND std.End_Date > its.Start_Date
) itmin
OUTER APPLY (
SELECT End_Date = MIN(Start_Date) FROM @Items std
WHERE std.Item_ID <> its.Item_ID AND std.Start_Date > its.Start_Date AND std.Start_Date < its.End_Date
) itmax
) as d
GROUP BY id;
Run Code Online (Sandbox Code Playgroud)
id Days
0 7
1 10
Run Code Online (Sandbox Code Playgroud)
INSERT INTO @Items
(id, Item_ID, Start_Date, End_Date)
VALUES
(0, 20009, '2015-01-23', '2015-01-26'),
(0, 20010, '2015-01-24', '2015-01-24'),
(0, 20011, '2015-01-23', '2015-01-26'),
(0, 20012, '2015-01-23', '2015-01-27'),
(0, 20013, '2015-01-23', '2015-01-27'),
(0, 20014, '2015-01-29', '2015-01-30'),
(1, 20009, '2016-01-20', '2016-01-24'),
(1, 20010, '2016-01-23', '2016-01-26'),
(1, 20011, '2016-01-25', '2016-01-29')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1839 次 |
| 最近记录: |