query optimization: time intervals

Nic*_*ico 10 performance sql-server-2008 sql-server-2008-r2 query-performance

In the main, I've got two kinds of time intervals:

presence time and absence time

absence time can be of different types (eg breaks, absences, special day and so on) and time intervals may overlap and / or intersect.

It is not for sure, that only plausible combinations of intervals exist in raw data, eg. overlapping presence-intervals don't make sense, but may exist. I've tried to identify resulting presence-time intervals in many ways now - for me, the most comfortable seems to be the follwing one.

;with "timestamps"
as
(
    select
        "id" = row_number() over ( order by "empId", "timestamp", "opening", "type" )
        , "empId"
        , "timestamp"
        , "type"
        , "opening"
    from
    (
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 1 as "type" from "worktime" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 2 as "type" from "break" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
        union all
        select "empId", "timestamp", "type", case when "types" = 'starttime' then 1 else -1 end as "opening" from
        ( select "empId", "starttime", "endtime", 3 as "type" from "absence" ) as data
        unpivot ( "timestamp" for "types" in ( "starttime", "endtime" ) ) as pvt
    ) as data
)
select 
      T1."empId"
    , "starttime"   = T1."timestamp"
    , "endtime"     = T2."timestamp"
from 
    "timestamps" as T1
    left join "timestamps" as T2
        on T2."empId" = T1."empId"
        and T2."id" = T1."id" + 1
    left join "timestamps" as RS
        on RS."empId" = T2."empId"
        and RS."id" <= T1."id"      
group by
    T1."empId", T1."timestamp", T2."timestamp"
having
    (sum( power( 2, RS."type" ) * RS."opening" ) = 2)
order by 
    T1."empId", T1."timestamp";
Run Code Online (Sandbox Code Playgroud)

see SQL-Fiddle for some demo data.

The raw data exist in different tables in form of "starttime" - "endtime" or "starttime" - "duration".

The idea was to get an ordered list of every timestamp with a "bitmasked" rolling sum of open intervals at each time to estimate presence time.

The fiddle works and gives estimated results, even if startimes of different intervals are equal. No indices are used in this example.

Is this the right way to achieve questioned task or is there a more elegant way for this?

If relevant for answering: amount of data will be up to several ten-thousand datasets per employee per table. sql-2012 is not available to calculate a rolling sum of predecessors inline in aggregate.


edit:

Just executed the query against larger amount of testdata ( 1000, 10.000, 100.000, 1 million) and can see that runtime increases exponentially. Obviously a warning flag, right?

I changed the query and removed aggregation of rolling sum by a quirky update.

I've added an auxiliary table:

create table timestamps
(
  "id" int
  , "empId" int
  , "timestamp" datetime
  , "type" int
  , "opening" int
  , "rolSum" int
)

create nonclustered index "idx" on "timestamps" ( "rolSum" ) include ( "id", "empId", "timestamp" )
Run Code Online (Sandbox Code Playgroud)

and I moved calculating rolling sum to this place:

declare @rolSum int = 0
update "timestamps" set @rolSum = "rolSum" = @rolSum + power( 2, "type" ) * "opening" from "timestamps"
Run Code Online (Sandbox Code Playgroud)

see SQL-Fiddle here

The runtime decreased to 3 sec regarding 1 million entries in the "worktime"-table.

Question stays the same: What's the most effective way to solve this?

Eri*_*ikE 3

我无法回答你关于绝对最佳方法的问题。但我可以提供一种不同的解决问题的方法,这可能会更好,也可能不会更好。它有一个相当平坦的执行计划,我认为它会表现良好。(我很想知道所以分享结果!)

我很抱歉使用我自己的语法风格而不是你的语法风格——当一切都在其通常的位置排列时,它有助于查询魔法出现。

该查询可在 SqlFiddle 中使用。我为 EmpID 1 添加了一个重叠部分,只是为了确保我已经涵盖了该内容。如果您最终发现存在数据中不会出现重叠,那么您可以删除最终的查询和计算Dense_Rank

WITH Points AS (
  SELECT DISTINCT
    T.EmpID,
    P.TimePoint
  FROM
    (
      SELECT * FROM dbo.WorkTime
      UNION SELECT * FROM dbo.BreakTime
      UNION SELECT * FROM dbo.Absence
    ) T
    CROSS APPLY (VALUES (StartTime), (EndTime)) P (TimePoint)
), Groups AS (
  SELECT
    P.EmpID,
    P.TimePoint,
    Grp =
      Row_Number()
      OVER (PARTITION BY P.EmpID ORDER BY P.TimePoint, X.Which) / 2
  FROM
    Points P
    CROSS JOIN (VALUES (1), (2)) X (Which)
), Ranges AS (
  SELECT
    G.EmpID,
    StartTime = Min(G.TimePoint),
    EndTime = Max(G.TimePoint)
  FROM Groups G
  GROUP BY
    G.EmpID,
    G.Grp
  HAVING Count(*) = 2
), Presences AS (
  SELECT
    R.*,
    P.Present,
    Grp =
       Dense_Rank() OVER (PARTITION BY R.EmpID ORDER BY R.StartTime)
       - Dense_Rank() OVER (PARTITION BY R.EmpID, P.Present ORDER BY R.StartTime)
  FROM
    Ranges R
    CROSS APPLY (
      SELECT
        CASE WHEN EXISTS (
          SELECT *
          FROM dbo.WorkTime W
          WHERE
            R.EmpID = W.EmpID
            AND R.StartTime < W.EndTime
            AND W.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.BreakTime B
          WHERE
            R.EmpID = B.EmpID
            AND R.StartTime < B.EndTime
            AND B.StartTime < R.EndTime
        ) AND NOT EXISTS (
          SELECT *
          FROM dbo.Absence A
          WHERE
            R.EmpID = A.EmpID
            AND R.StartTime < A.EndTime
            AND A.StartTime < R.EndTime
        ) THEN 1 ELSE 0 END
    ) P (Present)
)
SELECT
  EmpID,
  StartTime = Min(StartTime),
  EndTime = Max(EndTime)
FROM Presences
WHERE Present = 1
GROUP BY
  EmpID,
  Grp
ORDER BY
  EmpID,
  StartTime;
Run Code Online (Sandbox Code Playgroud)

注意:如果合并三个表并添加一列来指示当前时间类型:工作、休息或缺勤,则此查询的性能将会得到提高。

您可能会问,为什么会有这么多 CTE?因为每一项都是我需要对数据做的事情所迫使的。有一个聚合,或者我需要在窗口函数上放置 WHERE 条件,或者在不允许使用窗口函数的子句中使用它。

现在我要出去看看我是否能想出另一种策略来实现这一目标。:)

为了好玩,我在这里添加了一个我为帮助解决问题而制作的“图表”:

------------
   -----------------
                ---------------
                           -----------

    ---    ------   ------       ------------

----   ----      ---      -------
Run Code Online (Sandbox Code Playgroud)

三组破折号(用空格分隔)按顺序表示:存在数据、缺席数据和所需结果。