Nic*_*ico 10 performance sql-server-2008 sql-server-2008-r2 query-performance
In the main, I've got two kinds of time intervals:
presence time
and absence time
absence time
can be of different types (eg breaks, absences, special day and so on) and time intervals may overlap and / or intersect.
It is not for sure, that only plausible combinations of intervals exist in raw data, eg. overlapping presence-intervals don't make sense, but may exist. I've tried to identify resulting presence-time intervals in many ways now - for me, the most comfortable seems to be the follwing one.
;with "timestamps"
as
(
select
"id" = row_number() over ( order by "empId", "timestamp", "opening", "type" )
, "empId"
, "timestamp"
, "type"
, "opening"
from
(
select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
( select "empId", "starttime", "endtime", 1 as "type" from "worktime" ) as data
unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
union all
select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
( select "empId", "starttime", "endtime", 2 as "type" from "break" ) as data
unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
union all
select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
( select "empId", "starttime", "endtime", 3 as "type" from "absence" ) as data
unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
) as data
)
select
T1."empId"
, "starttime" = T1."timestamp"
, "endtime" = T2."timestamp"
from
"timestamps" as T1
left join "timestamps" as T2
on T2."empId" = T1."empId"
and T2."id" = T1."id" + 1
left join "timestamps" as RS
on RS."empId" = T2."empId"
and RS."id" <= T1."id"
group by
T1."empId", T1."timestamp", T2."timestamp"
having
(sum( power( 2, RS."type" ) * RS."opening" ) = 2)
order by
T1."empId", T1."timestamp";
Run Code Online (Sandbox Code Playgroud)
see SQL-Fiddle for some demo data.
The raw data exist in different tables in form of "starttime" - "endtime"
or "starttime" - "duration"
.
The idea was to get an ordered list of every timestamp with a "bitmasked" rolling sum of open intervals at each time to estimate presence time.
The fiddle works and gives estimated results, even if startimes of different intervals are equal. No indices are used in this example.
Is this the right way to achieve questioned task or is there a more elegant way for this?
If relevant for answering: amount of data will be up to several ten-thousand datasets per employee per table. sql-2012 is not available to calculate a rolling sum of predecessors inline in aggregate.
edit:
Just executed the query against larger amount of testdata ( 1000, 10.000, 100.000, 1 million) and can see that runtime increases exponentially. Obviously a warning flag, right?
I changed the query and removed aggregation of rolling sum by a quirky update.
I've added an auxiliary table:
create table timestamps
(
"id" int
, "empId" int
, "timestamp" datetime
, "type" int
, "opening" int
, "rolSum" int
)
create nonclustered index "idx" on "timestamps" ( "rolSum" ) include ( "id", "empId", "timestamp" )
Run Code Online (Sandbox Code Playgroud)
and I moved calculating rolling sum to this place:
declare @rolSum int = 0
update "timestamps" set @rolSum = "rolSum" = @rolSum + power( 2, "type" ) * "opening" from "timestamps"
Run Code Online (Sandbox Code Playgroud)
The runtime decreased to 3 sec regarding 1 million entries in the "worktime"-table.
Question stays the same: What's the most effective way to solve this?
我无法回答你关于绝对最佳方法的问题。但我可以提供一种不同的解决问题的方法,这可能会更好,也可能不会更好。它有一个相当平坦的执行计划,我认为它会表现良好。(我很想知道所以分享结果!)
我很抱歉使用我自己的语法风格而不是你的语法风格——当一切都在其通常的位置排列时,它有助于查询魔法出现。
该查询可在 SqlFiddle 中使用。我为 EmpID 1 添加了一个重叠部分,只是为了确保我已经涵盖了该内容。如果您最终发现存在数据中不会出现重叠,那么您可以删除最终的查询和计算Dense_Rank
。
WITH Points AS (
SELECT DISTINCT
T.EmpID,
P.TimePoint
FROM
(
SELECT * FROM dbo.WorkTime
UNION SELECT * FROM dbo.BreakTime
UNION SELECT * FROM dbo.Absence
) T
CROSS APPLY (VALUES (StartTime), (EndTime)) P (TimePoint)
), Groups AS (
SELECT
P.EmpID,
P.TimePoint,
Grp =
Row_Number()
OVER (PARTITION BY P.EmpID ORDER BY P.TimePoint, X.Which) / 2
FROM
Points P
CROSS JOIN (VALUES (1), (2)) X (Which)
), Ranges AS (
SELECT
G.EmpID,
StartTime = Min(G.TimePoint),
EndTime = Max(G.TimePoint)
FROM Groups G
GROUP BY
G.EmpID,
G.Grp
HAVING Count(*) = 2
), Presences AS (
SELECT
R.*,
P.Present,
Grp =
Dense_Rank() OVER (PARTITION BY R.EmpID ORDER BY R.StartTime)
- Dense_Rank() OVER (PARTITION BY R.EmpID, P.Present ORDER BY R.StartTime)
FROM
Ranges R
CROSS APPLY (
SELECT
CASE WHEN EXISTS (
SELECT *
FROM dbo.WorkTime W
WHERE
R.EmpID = W.EmpID
AND R.StartTime < W.EndTime
AND W.StartTime < R.EndTime
) AND NOT EXISTS (
SELECT *
FROM dbo.BreakTime B
WHERE
R.EmpID = B.EmpID
AND R.StartTime < B.EndTime
AND B.StartTime < R.EndTime
) AND NOT EXISTS (
SELECT *
FROM dbo.Absence A
WHERE
R.EmpID = A.EmpID
AND R.StartTime < A.EndTime
AND A.StartTime < R.EndTime
) THEN 1 ELSE 0 END
) P (Present)
)
SELECT
EmpID,
StartTime = Min(StartTime),
EndTime = Max(EndTime)
FROM Presences
WHERE Present = 1
GROUP BY
EmpID,
Grp
ORDER BY
EmpID,
StartTime;
Run Code Online (Sandbox Code Playgroud)
注意:如果合并三个表并添加一列来指示当前时间类型:工作、休息或缺勤,则此查询的性能将会得到提高。
您可能会问,为什么会有这么多 CTE?因为每一项都是我需要对数据做的事情所迫使的。有一个聚合,或者我需要在窗口函数上放置 WHERE 条件,或者在不允许使用窗口函数的子句中使用它。
现在我要出去看看我是否能想出另一种策略来实现这一目标。:)
为了好玩,我在这里添加了一个我为帮助解决问题而制作的“图表”:
------------
-----------------
---------------
-----------
--- ------ ------ ------------
---- ---- --- -------
Run Code Online (Sandbox Code Playgroud)
三组破折号(用空格分隔)按顺序表示:存在数据、缺席数据和所需结果。
归档时间: |
|
查看次数: |
1175 次 |
最近记录: |