SWe*_*eko 4 sql t-sql sql-server
我有以下数据:
StartDate | EndDate
-------------------------
1982.03.02 | 1982.09.30
1982.10.01 | 1985.01.17
1985.06.26 | 1985.07.26
1985.07.30 | 1991.12.31
1992.01.01 | 1995.12.31
1996.01.01 | 2004.05.31
2004.06.05 | 2006.01.31
2006.02.01 | 2011.05.20
Run Code Online (Sandbox Code Playgroud)
我需要合并任何相邻的区间(开始日期和结束日期都包含在区间中,因此结束于2003.05.06的区间与2003.05.07开始的区间相邻),所以在这种情况下,结果集合应该:
StartDate | EndDate
-------------------------
1982.03.02 | 1985.01.17
1985.06.26 | 1985.07.26
1985.07.30 | 2004.05.31
2004.06.05 | 2011.05.20
Run Code Online (Sandbox Code Playgroud)
对我来说,显而易见的方法是使用游标迭代集合,并逐行构造结果集.但是,此功能将位于可能在一天内在重负载的服务器上被调用数千次的代码内,因此我不希望出现任何性能问题.任何数据集都很小(20行顶部),数据范围很大,因此任何生成范围内所有日期的解决方案都是不可行的.
有没有更好的方式我没有看到?
初始化代码(来自Damien的回答):
CREATE TABLE Periods (
StartDate datetime NOT NULL CONSTRAINT PK_Periods PRIMARY KEY CLUSTERED,
EndDate datetime NOT NULL
)
INSERT INTO Periods(StartDate,EndDate)
SELECT '19820302', '19820930'
UNION ALL SELECT '19821001', '19850117'
UNION ALL SELECT '19850626', '19850726'
UNION ALL SELECT '19850730', '19911231'
UNION ALL SELECT '19920101', '19951231'
UNION ALL SELECT '19960101', '20040531'
UNION ALL SELECT '20040605', '20060131'
UNION ALL SELECT '20060201', '20110520'
Run Code Online (Sandbox Code Playgroud)
这是一个到目前为止执行最佳提交的查询,在执行计划中只有两个表访问(而不是三个或更多).所有查询当然都有索引帮助.请注意,执行计划将此查询评为更昂贵,但实际的读取和CPU明显更好.执行计划中的估计成本与实际绩效不同.
WITH Grps AS (
SELECT
(Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
P1.StartDate,
P1.EndDate
FROM
Periods P1
CROSS JOIN (SELECT -1 UNION ALL SELECT 1) D (Dir)
LEFT JOIN Periods P2 ON
DateAdd(Day, D.Dir, P1.StartDate) = P2.EndDate
OR DateAdd(Day, D.Dir, P1.EndDate) = P2.StartDate
WHERE
(Dir = -1 AND P2.EndDate IS NULL)
OR (Dir = 1 AND P2.StartDate IS NULL)
)
SELECT
Min(StartDate) StartDate,
Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;
Run Code Online (Sandbox Code Playgroud)
我认为值得一提的另一件事是,如果您使用独占结束日期(也称为"开放"结束日期)而不是封闭日期,那么在大多数情况下查询日期周期表会更简单,性能更好:
StartDate | EndDate | EndDate
(Inclusive) | (Inclusive) | (Exclusive)
---------------------------------------
1982.03.02 | 1982.09.30 | 1982.10.01
1982.10.01 | 1985.01.17 | 1985.01.18
Run Code Online (Sandbox Code Playgroud)
在大多数情况下,使用独占结束日期是(在我看来)最佳实践,因为它允许您更改日期列的数据类型或更改日期的分辨率,而不会影响任何查询,代码或其他逻辑.例如,如果您的日期需要到最近的12小时而不是24小时,那么您需要做大量工作才能完成,而如果您使用独家结束日期,则不会有任何改变!
如果您使用独占结束日期,我的查询将如下所示:
WITH Grps AS (
SELECT
(Row_Number() OVER (ORDER BY P1.StartDate) - 1) / 2 Grp,
P1.StartDate,
P1.EndDate
FROM
Periods P1
CROSS JOIN (SELECT 1 UNION ALL SELECT 2) X (Which)
LEFT JOIN Periods P2 ON
(X.Which = 1 AND P1.StartDate = P2.EndDate)
OR (X.Which = 2 AND P1.EndDate = P2.StartDate)
WHERE
P2.EndDate IS NULL
OR P2.StartDate IS NULL
)
SELECT
Min(StartDate) StartDate,
Max(EndDate) EndDate
FROM Grps
GROUP BY Grp;
Run Code Online (Sandbox Code Playgroud)
请注意,现在没有DateAdd或DateDiff,硬编码值为"1天",如果您例如切换到12小时,则必须更改.
这是一个更新的查询,其中包含了我在过去近5年中学到的知识.这个查询现在根本没有连接,虽然它确实有3个排序操作,这可能是性能问题,我认为这个查询将相当好地竞争,并且在没有索引的情况下可能会击败所有其他人.
WITH Groups AS (
SELECT Grp = Row_Number() OVER (ORDER BY StartDate) / 2, *
FROM
#Periods
(VALUES (0), (0)) X (Dup)
), Ranges AS (
SELECT StartDate = Max(StartDate), EndDate = Min(EndDate)
FROM Groups
GROUP BY Grp
HAVING Max(StartDate) <> DateAdd(day, 1, Min(EndDate))
), ReGroups AS (
SELECT
Grp = Row_Number() OVER (ORDER BY StartDate) / 2,
StartDate,
EndDate
FROM
Ranges
CROSS JOIN (VALUES (0), (0)) X (Dup)
)
SELECT
StartDate = Min(StartDate),
EndDate = Max(EndDate)
FROM ReGroups
GROUP BY Grp
HAVING Count(*) = 2
;
Run Code Online (Sandbox Code Playgroud)
这是另一个使用窗口函数的版本(以前的查询模拟的类型):
WITH LeadLag AS (
SELECT
PrevEndDate = Coalesce(Lag(EndDate) OVER (ORDER BY StartDate), '00010101'),
NextStartDate = Coalesce(Lead(StartDate) OVER (ORDER BY StartDate), '99991231'),
*
FROM #Periods
), Dates AS (
SELECT
X.*
FROM
LeadLag
CROSS APPLY (
SELECT
StartDate = CASE WHEN DateAdd(day, 1, PrevEndDate) <> StartDate THEN StartDate ELSE NULL END,
EndDate = CASE WHEN DateAdd(day, 1, EndDate) <> NextStartDate THEN EndDate ELSE NULL END
) X
WHERE
X.StartDate IS NOT NULL
OR X.EndDate IS NOT NULL
), Final AS (
SELECT
StartDate,
EndDate = Min(EndDate) OVER (ORDER BY EndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM Dates
)
SELECT *
FROM Final
WHERE StartDate IS NOT NULL
;
Run Code Online (Sandbox Code Playgroud)
设置示例数据比编写查询需要更长的时间 - 如果您发布包含CREATE TABLE
和INSERT/SELECT
语句的问题会更好.我不知道你的桌子叫什么,我称之为我的期间:
create table Periods (
StartDate date not null,
EndDate date not null
)
go
insert into Periods(StartDate,EndDate)
select '19820302','19820930' union all
select '19821001','19850117' union all
select '19850626','19850726' union all
select '19850730','19911231' union all
select '19920101','19951231' union all
select '19960101','20040531' union all
select '20040605','20060131' union all
select '20060201','20110520'
go
; with MergedPeriods as (
Select p1.StartDate, p1.EndDate
from
Periods p1
left join
Periods p2
on
p1.StartDate = DATEADD(day,1,p2.EndDate)
where
p2.StartDate is null
union all
select p1.StartDate,p2.EndDate
from
MergedPeriods p1
inner join
Periods p2
on
p1.EndDate = DATEADD(day,-1,p2.StartDate)
)
select StartDate,MAX(EndDate) as EndDate
from MergedPeriods group by StartDate
Run Code Online (Sandbox Code Playgroud)
结果:
StartDate EndDate
1982-03-02 1985-01-17
1985-06-26 1985-07-26
1985-07-30 2004-05-31
2004-06-05 2011-05-20
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
6993 次 |
最近记录: |