SQL Server:找到就业方面的差距 - 岛屿和差距问题

Rek*_*eky 6 t-sql sql-server gaps-and-islands

我一直在经历堆栈溢出试图在上周解决这个问题,我仍然无法找到可行的解决方案,所以想知道是否有人可以给我一些帮助/建议?

数据结构的说明

我有以下表格:

zz_position用于保存职位详细信息的职位表()(职务ID)包括其有效的日期范围.

PosNo   Description                Date_From    Date_To 
---------------------------------------------------------
10001   System Administrator       20170101     20231231
Run Code Online (Sandbox Code Playgroud)

资源表(zz_resource),用于保存资源(员工)的详细信息,包括他们加入公司并离开的日期

resID   description  date_from   date_to
------------------------------------------
100     Sam          20160101    20991231
101     Joe          20150101    20991231 
Run Code Online (Sandbox Code Playgroud)

就业表(zz_employment)用于将位置与日期范围内的资源相关联

PosNo    resID       Date_From   Date_To     seqNo
---------------------------------------------------
10001    100         20180101    20180401    1
10001    101         20180601    20191231    2
10001    100         20200101    20991231    3
Run Code Online (Sandbox Code Playgroud)

问题

现在由于人们改变职位,一个帖子可能在一段时间内没有填补,我想要做的是制作一份报告,我可以用它来告诉我任何时间点的帖子状态.

我知道我可以使用日历表生成一个完全映射的日历表,但我想要的是一个报告,它以下列聚合格式生成数据:

PosNo   resID      Date_From   Date_To    seqNo
-------------------------------------------------
10001   NULL       20170101    20171231   0
10001   100        20180101    20180401   1
10001   NULL       20180402    20180530   0
10001   101        20180601    20191231   2
10001   100        20200101    20231231   3


insert into zz_employment 
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
       ('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
       ('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)
Run Code Online (Sandbox Code Playgroud)

(请注意报告如何采用表格中的两条线,并产生了一个完整的就业生活,其中第一个空行日期从位置开始日期和最后一个行日期拉到从结束日期拉出.

理想情况下,我希望这是一个视图/函数,但由于复杂性,我非常乐意拥有一系列T SQL语句,我可以每晚作为数据仓库例程的一部分运行.

规则

  • 所有日期都被截断为datetime,以便date_to引用它结束的日期而不是它结束的日期和时间
  • 如果职位/就业/资源没有结束日期,那么它将被表示为20991231
  • 如果就业本身是开放式的,则就业表中的日期表示为20991231,即使该职位本身可能在20231231结束.理想情况下,我希望结果尊重职位结束日期.

SQL代码:

CREATE TABLE zz_position  
(
     posNo varchar(25) NOT NULL,  
     description varchar(25) NOT NULL,  
     date_from datetime NULL,  
     date_to datetime NULL
) 

insert into zz_position 
values ('10001', 'System Administrator', '2017-01-01 00:00:00.000', '2020-12-31 00:00:00.000')
go

CREATE TABLE zz_resource
(
     resID varchar(25) NOT NULL,  
     description varchar(25) NOT NULL,  
     date_from datetime NULL,  
     date_to datetime NULL
)  

insert into zz_resource 
values ('100', 'Sam', '2016-01-01 00:00:00.000', '2099-12-31 00:00:00.000'),
       ('101', 'Joe', '2015-01-01 00:00:00.000', '2099-12-31 00:00:00.000')
go

CREATE TABLE zz_employment
(
      posNo varchar(25) NOT NULL,  
      resID varchar(25) NOT NULL,  
      date_from datetime NULL,  
      date_to datetime NULL,
      seqNo int NULL
)  

insert into zz_employment 
values ('10001', '100', '2018-01-01 00:00:00.000', '2018-04-01 00:00:00.000', 1),
       ('10001', '101', '2018-06-01 00:00:00.000', '2019-12-31 00:00:00.000', 2),
       ('10001', '100', '2020-01-01 00:00:00.000', '2099-12-31 00:00:00.000', 3)
Run Code Online (Sandbox Code Playgroud)

EzL*_*zLo 2

对于这个问题有两个注意事项:

  • 日历表。
  • 当中间有就业期时,一种正确对失业期进行分组的方法。

以下解决方案使用日历表(包括 SQL)和DATEDIFF()锚日期技巧来正确分组第二点。

在这里完成DB Fiddle

解决方案(解释如下):

;WITH AllPositionDates AS
(
    SELECT
        T.posNo,
        C.GeneratedDate
    FROM
        zz_position AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
    SELECT
        T.posNo,
        T.resID,
        T.seqNo,
        C.GeneratedDate
    FROM
        zz_employment AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
PositionsByEmployed AS
(
    SELECT
        P.posNo,
        P.GeneratedDate,
        E.resID,
        E.seqNo,
        NullRowNumber = ROW_NUMBER() OVER (
            PARTITION BY
                P.posNo,
                CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
            ORDER BY
                P.GeneratedDate ASC)
    FROM
        AllPositionDates AS P
        LEFT JOIN AllEmployedDates AS E ON
            P.posNo = E.posNo AND
            P.GeneratedDate = E.GeneratedDate
)
SELECT
    P.posNo,
    P.resID,
    Date_From = MIN(P.GeneratedDate),
    Date_To = MAX(P.GeneratedDate),
    seqNo = ISNULL(P.seqNo, 0)
FROM
    PositionsByEmployed AS P
GROUP BY
    P.posNo,
    P.resID,
    P.seqNo,
    CASE WHEN P.resId IS NULL THEN P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate) END -- GroupingValueGroupingValue
ORDER BY
    P.posNo,
    Date_From,
    Date_To
Run Code Online (Sandbox Code Playgroud)

结果:

posNo   resID   Date_From   Date_To     seqNo
10001   NULL    2017-01-01  2017-12-31  0
10001   100     2018-01-01  2018-04-01  1
10001   NULL    2018-04-02  2018-05-31  0
10001   101     2018-06-01  2019-12-31  2
10001   100     2020-01-01  2020-12-31  3
Run Code Online (Sandbox Code Playgroud)

解释

首先创建日历表。每天保存 1 行,在本例中,它仅限于职位的第一天和最后一天:

DECLARE @DateStart DATE = (SELECT MIN(P.date_from) FROM zz_position AS P)
DECLARE @DateEnd DATE = (SELECT(MAX(P.date_to)) FROM zz_position AS P)

;WITH GeneratedDates AS
(
    SELECT
        GeneratedDate = @DateStart

    UNION ALL

    SELECT
        GeneratedDate = DATEADD(DAY, 1, G.GeneratedDate)
    FROM
        GeneratedDates AS G
    WHERE
        DATEADD(DAY, 1, G.GeneratedDate) <= @DateEnd
)
SELECT
    DateID = IDENTITY(INT, 1, 1),
    G.GeneratedDate
INTO
    Calendar
FROM
    GeneratedDates AS G
OPTION
    (MAXRECURSION 0)
Run Code Online (Sandbox Code Playgroud)

这会生成以下内容(截至 2020 年 12 月 31 日,这是样本数据的最大日期):

DateID  GeneratedDate
1       2017-01-01
2       2017-01-02
3       2017-01-03
4       2017-01-04
5       2017-01-05
6       2017-01-06
7       2017-01-07
Run Code Online (Sandbox Code Playgroud)

现在,我们使用中间的连接来“分散”职位的周期和员工的周期(在不同的 CTE 上),因此我们每天为每个职位/员工获取 1 行。

-- AllPositionDates
SELECT
    T.posNo,
    C.GeneratedDate
FROM
    zz_position AS T
    INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to

-- AllEmployedDates
SELECT
    T.posNo,
    T.resID,
    T.seqNo,
    C.GeneratedDate
FROM
    zz_employment AS T
    INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
Run Code Online (Sandbox Code Playgroud)

有了这些,我们使用 ,按职位和日期将它们连接在一起LEFT JOIN,这样我们就可以获得每个职位的所有日期和匹配的员工(如果存在)。NULL我们还计算稍后将使用的每个位置的所有值的行号。请注意,此行号随着随后的每个日期相应地增加 1。

;WITH AllPositionDates AS
(
    SELECT
        T.posNo,
        C.GeneratedDate
    FROM
        zz_position AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
),
AllEmployedDates AS
(
    SELECT
        T.posNo,
        T.resID,
        T.seqNo,
        C.GeneratedDate
    FROM
        zz_employment AS T
        INNER JOIN Calendar AS C ON C.GeneratedDate BETWEEN T.date_from AND T.date_to
)
-- PositionsByEmployee
SELECT
    P.posNo,
    P.GeneratedDate,
    E.resID,
    E.seqNo,
    NullRowNumber = ROW_NUMBER() OVER (
        PARTITION BY
            P.posNo,
            CASE WHEN E.posNo IS NULL THEN 1 ELSE 2 END
        ORDER BY
            P.GeneratedDate ASC)
    FROM
        AllPositionDates AS P
        LEFT JOIN AllEmployedDates AS E ON
            P.posNo = E.posNo AND
            P.GeneratedDate = E.GeneratedDate
Run Code Online (Sandbox Code Playgroud)

现在是棘手的部分。如果我们计算硬编码日期与每一天之间的差异天数,我们会得到一个类似的“行号”,该行号在每个日期中持续增加。

SELECT
    P.posNo,
    P.GeneratedDate,
    DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
    P.NullRowNumber
FROM
    PositionsByEmployed AS P -- This is declare with the WITH (full solution below)
ORDER BY
    P.posNo,
    P.GeneratedDate
Run Code Online (Sandbox Code Playgroud)

我们得到以下信息:

posNo   GeneratedDate   DateDiff    NullRowNumber
10001   2017-01-01      6210        1
10001   2017-01-02      6211        2
10001   2017-01-03      6212        3
10001   2017-01-04      6213        4
10001   2017-01-05      6214        5
10001   2017-01-06      6215        6
10001   2017-01-07      6216        7
10001   2017-01-08      6217        8
10001   2017-01-09      6218        9
Run Code Online (Sandbox Code Playgroud)

如果我们添加另一列与其余两列,您将看到该值保持不变:

SELECT
    P.posNo,
    P.GeneratedDate,
    DateDiff = DATEDIFF(DAY, '2000-01-01', P.GeneratedDate),
    P.NullRowNumber,
    GroupingValue = P.NullRowNumber - DATEDIFF(DAY, '2000-01-01', P.GeneratedDate)
FROM
    PositionsByEmployed AS P
ORDER BY
    P.posNo,
    P.GeneratedDate
Run Code Online (Sandbox Code Playgroud)

我们得到:

posNo   GeneratedDate   DateDiff    NullRowNumber   GroupingValue
10001   2017-01-01      6210        1               -6209
10001   2017-01-02      6211        2               -6209
10001   2017-01-03      6212        3               -6209
10001   2017-01-04      6213        4               -6209
10001   2017-01-05      6214        5               -6209
10001   2017-01-06      6215        6               -6209
10001   2017-01-07      6216        7               -6209
10001   2017-01-08      6217        8               -6209
10001   2017-01-09      6218        9               -6209
10001   2017-01-10      6219        10              -6209
Run Code Online (Sandbox Code Playgroud)

但是,如果我们向下滚动,直到看到员工的值为 NULL(来自表达式ROW_NUMBER() PARTITION BYE.PosNo,我们会发现其余部分有所不同,因为ROW_NUMBER()1 不断增加 1 并跳跃,DATEDIFF因为中间有受雇人员:

posNo   GeneratedDate   DateDiff    NullRowNumber   GroupingValue
10001   2017-12-28      6571        362             -6209
10001   2017-12-29      6572        363             -6209
10001   2017-12-30      6573        364             -6209
10001   2017-12-31      6574        365             -6209
...
10001   2018-04-02      6666        366             -6300
10001   2018-04-03      6667        367             -6300
10001   2018-04-04      6668        368             -6300
10001   2018-04-05      6669        369             -6300
10001   2018-04-06      6670        370             -6300
10001   2018-04-07      6671        371             -6300
Run Code Online (Sandbox Code Playgroud)

使用此“GroupingValue”作为附加项,GROUP BY以正确分隔超出已使用间隔的位置间隔。