如何从 PostgreSQL 表中获取不重叠的不同间隔?

ERJ*_*JAN 8 time range-types postgresql-9.6

使用 postgresql 9.6。

该表有用户会话,我需要打印不同的非重叠会话。

CREATE TABLE SESSIONS(
            id serial NOT NULL PRIMARY KEY, 
            ctn INT NOT NULL, 
            day DATE NOT NULL,
            f_time TIME(0) NOT NULL,
            l_time TIME(0) NOT  NULL
        );     
    INSERT INTO SESSIONS(id, ctn, day, f_time, l_time)
    VALUES
    (1, 707, '2019-06-18', '10:48:25', '10:56:17'),
    (2, 707, '2019-06-18', '10:48:33', '10:56:17'),
    (3, 707, '2019-06-18', '10:53:17', '11:00:49'),
    (4, 707, '2019-06-18', '10:54:31', '10:57:37'),
    (5, 707, '2019-06-18', '11:03:59', '11:10:39'),
    (6, 707, '2019-06-18', '11:04:41', '11:08:02'),
    (7, 707, '2019-06-18', '11:11:04', '11:19:39');
Run Code Online (Sandbox Code Playgroud)

sql 小提琴

id  ctn day         f_time      l_time
1   707 2019-06-18  10:48:25    10:56:17
2   707 2019-06-18  10:48:33    10:56:17
3   707 2019-06-18  10:53:17    11:00:49
4   707 2019-06-18  10:54:31    10:57:37
5   707 2019-06-18  11:03:59    11:10:39
6   707 2019-06-18  11:04:41    11:08:02
7   707 2019-06-18  11:11:04    11:19:39
Run Code Online (Sandbox Code Playgroud)

现在我需要不同的非重叠用户会话,所以它应该给我:

1.  start_time: 10:48:25  end_time: 11:00:49  duration: 12min,24 sec
2.  start_time: 11:03:59  end_time: 11:10:39  duration: 6min,40 sec
3.  start_time: 11:11:04  end_time: 11:19:39  duration: 8min,35 sec
Run Code Online (Sandbox Code Playgroud)

Vér*_*ace 9

为了解决这个问题,我做了以下事情:

“易”解释:

对于这部分,我稍微添加了 OP 提供的表定义。我坚信 DDL 应该最大程度地用于“指导”整个数据库编程过程,并且可能会更强大 - 一个例子是CHECK约束中的SQL - 到目前为止仅由 Firebird 提供(此处的示例)和 H2(请参阅此处的参考)。

然而,这一切都很好,但我们必须处理 PostgreSQL 的 9.6 功能 - OP 的版本。我为“简单”解释调整了 DDL(请参阅此处的整个小提琴):

CREATE TABLE sessions
(
        id serial NOT NULL PRIMARY KEY, 
        ctn INT NOT NULL, 
        f_day DATE NOT NULL,
        f_time TIME(0) NOT NULL,
        l_time TIME(0) NOT  NULL,
        CONSTRAINT ft_less_than_lt_ck CHECK (f_time < l_time),
        CONSTRAINT ctn_f_day_f_time_uq UNIQUE (ctn, f_day, f_time),
        CONSTRAINT ctn_f_day_l_time_uq UNIQUE (ctn, f_day, l_time)
        -- could put in a DISTINCT somewhere if you don't have these constraints
        -- maybe has TIME(2) - but see complex solution
);
Run Code Online (Sandbox Code Playgroud)

索引:

CREATE INDEX ctn_ix ON sessions USING BTREE (ctn ASC);
CREATE INDEX f_day_ix ON sessions USING BTREE (f_day ASC);
CREATE INDEX f_time_ix ON sessions USING BTREE (f_time ASC);
Run Code Online (Sandbox Code Playgroud)

需要注意的一点:不要使用SQL 关键字作为表名或列名 -day就是这样的关键字!调试 &c 可能会令人困惑 - 这根本不是一个好习惯。我已将您的原始字段名称更改dayf_day- 注意所有小写和 python 大小写!无论你做什么,都有一个命名变量的标准方法并坚持下去- 那里有许多编码标准文档。

'f_day' 的更改对 SQL 的其余部分没有影响,因为我们没有考虑跨越午夜的会话。通过执行以下操作可以相对容易地考虑到这些(参见小提琴)。

SELECT (f_day + f_time)::TIMESTAMP FROM sessions;
Run Code Online (Sandbox Code Playgroud)

现在GENERATED列的出现,您甚至不必担心这个 - 只需有一个GENERATED如上所述的字段!

如果对第二个的约束不可行 - 同时登录,您可能会使用TIME(2) (or 3..6)以确保唯一性。如果 [你不想要 | 不能有]UNIQUE约束,您可以在DISTINCTSQL 中输入相同的登录和注销时间 - 尽管这不太可能。

事实仍然是,一些像这样的简单 DDL极大地简化了您的后续 SQL(请参阅下面“复杂”解释末尾的讨论)。

您可能还想放置ctn和/或day放入您的 DDLUNIQUE约束,如图所示?我还添加了我认为可能是好的索引!您可能还想调查OVERLAPS运营商?

至于示例数据,我还添加了一些记录来测试我的解决方案,如下所示:

INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'), 
( 2, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 3, 707, '2019-06-18', '10:54:31', '10:59:43'),  -- record 3 is completely covered 
                                                  -- by record 2

( 4, 707, '2019-06-18', '11:03:59', '11:10:39'), 
( 5, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2 record 6 completely
                                                 -- covers record 7
                                                 
( 6, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3

( 7, 707, '2019-06-18', '12:15:15', '13:13:13'),
( 8, 707, '2019-06-18', '13:04:41', '13:20:02'), 
( 9, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4

(10, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5
Run Code Online (Sandbox Code Playgroud)

我将一步一步地梳理我的逻辑——也许对你有好处,但对我也有好处,因为它可以帮助我澄清我的想法,并确保我从这个练习中学到的教训会留在我身边——“我听到并我忘记了。我看到了,我记住了。我知道了,我明白了。” -孔子

以下所有内容都包含在小提琴中。

/**

So, the desired result is:

Interval 1 - start: 10:48:25 - end 11:00:49
Interval 2 - start: 11:03:59 - end 11:10:39 
Interval 3 - start: 11:11:04 - end 11:19:39
Interval 4 - start: 12:15:15 - end 13:22:22
Interval 5 - start: 14:05:17 - end 14:14:14

**/
Run Code Online (Sandbox Code Playgroud)

第一步是使用LAG函数(文档)如下:

SELECT 
  s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
  CASE
    WHEN LAG(s.l_time) OVER () > f_time THEN 0
    ELSE 1
  END AS ovl
FROM sessions s
Run Code Online (Sandbox Code Playgroud)

结果:

id  ctn     ft  lt  ovl
1   707     10:48:25    10:56:17    1
2   707     10:53:17    11:00:49    0
3   707     10:54:31    10:59:43    0
4   707     11:03:59    11:10:39    1
5   707     11:04:41    11:08:02    0
6   707     11:11:04    11:19:39    1
7   707     12:15:15    13:13:13    1
8   707     13:04:41    13:20:02    0
9   707     13:17:17    13:22:22    0
10  707     14:05:17    14:14:14    1
Run Code Online (Sandbox Code Playgroud)

所以,每当有一个新的间隔时,就会有一个 1 ovl(重叠)列中。

接下来,我们SUM按如下方式计算这些 1 的累积:

SELECT 
  t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
  SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                          AND CURRENT ROW) AS s
FROM
(
  SELECT 
    s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
    CASE
      WHEN LAG(s.l_time) OVER () > f_time THEN 0
      ELSE 1
    END AS ovl
  FROM sessions s
) AS t1
ORDER BY lt, id
Run Code Online (Sandbox Code Playgroud)

结果:

id  ctn     ft  lt  ovl     s
1   707     10:48:25    10:56:17    1   1
3   707     10:54:31    10:59:43    0   1
2   707     10:53:17    11:00:49    0   1
5   707     11:04:41    11:08:02    0   2
4   707     11:03:59    11:10:39    1   2
6   707     11:11:04    11:19:39    1   3
7   707     12:15:15    13:13:13    1   4
8   707     13:04:41    13:20:02    0   4
9   707     13:17:17    13:22:22    0   4
10  707     14:05:17    14:14:14    1   5
Run Code Online (Sandbox Code Playgroud)

所以,我们现在已经“拆分”了,并且有办法区分我们的区间——每个区间都有不同的值 s- 1..5。

所以,现在我们想要获得这些区间的最低值f_time和最高值l_time。我第一次尝试使用MAX()MIN()进行如下:

SELECT 
  ROW_NUMBER() OVER (PARTITION BY s) AS rn,
  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft, lt) AS max_l,
  s
FROM
(
  SELECT 
    t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
  FROM
  (
    SELECT 
      s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
      CASE
        WHEN LAG(s.l_time) OVER () > f_time THEN 0
        ELSE 1
      END AS ovl
    FROM sessions s
  ) AS t1;
  ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;
Run Code Online (Sandbox Code Playgroud)

结果:

rn  min_f   max_l   s
1   10:48:25    10:56:17    1
2   10:48:25    11:00:49    1
3   10:48:25    11:00:49    1
1   11:03:59    11:10:39    2
2   11:03:59    11:10:39    2
1   11:11:04    11:19:39    3
1   12:15:15    13:13:13    4
2   12:15:15    13:20:02    4
3   12:15:15    13:22:22    4
1   14:05:17    14:14:14    5
Run Code Online (Sandbox Code Playgroud)

请注意我们如何获得rn第一个区间的rn= 3,第四个区间的= 3 以及rn不同区间的不同值- 如果有 7 个子区间组成一个区间,那么我们将不得不检索rn= 7 - 这让我感到困惑一阵子!

然后 Window 函数的力量就派上用场了——如果你对MAX()和 进行MIN()不同的排序,正确的结果就会出现在我们的腿上:

SELECT 
  ROW_NUMBER() OVER (PARTITION BY s) AS rn,
  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
  s
FROM
(
  SELECT 
    t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
  FROM
  (
    SELECT 
      s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
      CASE
        WHEN LAG(s.l_time) OVER () > f_time THEN 0
        ELSE 1
      END AS ovl
    FROM sessions s
  ) AS t1
  ORDER BY id, lt
)AS t2
ORDER BY s, rn ASC, min_f;
Run Code Online (Sandbox Code Playgroud)

结果:

rn  min_f   max_l   s
1   10:48:25    11:00:49    1
2   10:48:25    11:00:49    1
3   10:48:25    10:59:43    1
1   11:03:59    11:10:39    2
2   11:03:59    11:08:02    2
1   11:11:04    11:19:39    3
1   12:15:15    13:22:22    4
2   12:15:15    13:22:22    4
3   12:15:15    13:22:22    4
1   14:05:17    14:14:14    5
Run Code Online (Sandbox Code Playgroud)

请注意,现在,rn= 1始终是我们想要的记录 - 这是以下结果:

  MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
  MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
Run Code Online (Sandbox Code Playgroud)

请注意,forMIN()的排序是 bylt DESC和 for MAX()(按间隔划分的 - 即s)它是 by ft DESC。这将最小的ft与最大的相匹配lt这正是我们想要的。

这基本上是我们想要的结果 - 只需根据 OP 的要求添加一些整理和格式,我们就可以开始了。这部分还演示了另一个非常有用的窗口函数 - ROW_NUMBER().

SELECT 
  ROW_NUMBER() OVER () AS "Interval No.", 
  ' Start time: ' AS " ",
  t3.min_f AS "Interval start" , 
  ' End time: ' AS " ",
  t3.max_l AS "Interval stop", 
  ' Duration: ' AS " ",
  (t3.max_l - t3.min_f) AS "Duration"
FROM
(
  SELECT 
    ROW_NUMBER() OVER (PARTITION BY s) AS rn,
    MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
    MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
    s
  FROM
  (
    SELECT 
      t1.id, t1.ctn, t1.ft, t1.lt, t1.ovl,
      SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                             AND CURRENT ROW) AS s
    FROM
    (
      SELECT 
        s.id AS id, s.ctn AS ctn, s.f_time AS ft, s.l_time AS lt, 
        CASE
          WHEN LAG(s.l_time) OVER () > f_time THEN 0
          ELSE 1
        END AS ovl
      FROM sessions s
    ) AS t1
    ORDER BY id, lt
  )AS t2
  ORDER BY s, rn ASC, min_f
) AS t3 
WHERE t3.rn = 1;
Run Code Online (Sandbox Code Playgroud)

最后结果:

Interval No.        Interval start      Interval stop       Duration
1    Start time:    10:48:25     End time:  11:00:49     Duration:  00:12:24
2    Start time:    11:03:59     End time:  11:10:39     Duration:  00:06:40
3    Start time:    11:11:04     End time:  11:19:39     Duration:  00:08:35
4    Start time:    12:15:15     End time:  13:22:22     Duration:  01:07:07
5    Start time:    14:05:17     End time:  14:14:14     Duration:  00:08:57
Run Code Online (Sandbox Code Playgroud)

如果有大量记录,我无法保证此查询的性能,请参阅EXPLAIN (ANALYZE, BUFFERS)小提琴末尾的结果。但是,我假设由于它采用报告样式格式,因此可能适用于ctn和/或的给定值day- 即没有太多记录?

“复杂”解释:

我不会展示每一步 - 消除重复的f_times 和l_times 后,步骤是相同的​​。

在这里,表定义和数据略有不同(此处提供小提琴):

CREATE TABLE sessions
(
        id serial NOT NULL PRIMARY KEY, 
        ctn INT NOT NULL, 
        f_day DATE NOT NULL,
        f_time TIME(0) NOT NULL,
        l_time TIME(0) NOT  NULL,
        CONSTRAINT ft_lt_lt CHECK (f_time < l_time),
        -- CONSTRAINT ft_uq UNIQUE (f_time),
        -- CONSTRAINT lt_uq UNIQUE (l_time)
        CONSTRAINT ft_lt_uq UNIQUE(f_time, l_time) 
        -- could put in a DISTINCT somewhere to counter this possibility or
        -- maybe have TIME(2) to ensure no duplicates?
);
Run Code Online (Sandbox Code Playgroud)

我保留的唯一限制是CHECK (f_time < l_time)(不能是任何其他方式)和UNIQUE f_time, l_time(可能添加day和/或添加ctn到 - 关于TIME(2) or (3...6)也适用。

我把它留给读者适用UNIQUE于组合ctnf_day适用!

INSERT INTO sessions (id, ctn, day, f_time, l_time)
VALUES
( 1, 707, '2019-06-18', '10:48:25', '10:56:17'), -- note - same l_times
( 2, 707, '2019-06-18', '10:48:33', '10:56:17'), -- need one with lowest f_time
( 3, 707, '2019-06-18', '10:53:17', '11:00:49'),
( 4, 707, '2019-06-18', '10:54:31', '10:59:43'), -- note - same f_times
                                                 -- need one with greatest l_time
( 5, 707, '2019-06-18', '10:54:31', '10:57:37'), -- GROUP 1

( 6, 707, '2019-06-18', '11:03:59', '11:10:39'), 
( 7, 707, '2019-06-18', '11:04:41', '11:08:02'), -- GROUP 2, record 6 completely
                                                 -- covers record 7
( 8, 707, '2019-06-18', '11:11:04', '11:19:39'), -- GROUP 3

( 9, 707, '2019-06-18', '12:15:15', '13:13:13'),
(10, 707, '2019-06-18', '13:04:41', '13:20:02'), 
(11, 707, '2019-06-18', '13:17:17', '13:22:22'), -- GROUP 4

(12, 707, '2019-06-18', '14:05:17', '14:14:14'); -- GROUP 5
Run Code Online (Sandbox Code Playgroud)

我添加了几个具有相同f_timel_time相同间隔的潜在“麻烦”记录(2 和 4)。因此,在相同 的情况下f_time,我们希望子间隔最大l_time,反之亦然,对于相同l_time(即最小f_time)的情况。

因此,在这种情况下,我所做的是通过链接CTE's(也称为WITH子句)来消除重复项,如下所示:

WITH cte1 AS 
(
  SELECT s.*, t.mt, t.lt
  FROM sessions s
  JOIN
  (
    SELECT
      DISTINCT 
      ctn,
      MIN(f_time) AS mt,
      l_time AS lt
    FROM sessions
    GROUP BY ctn, l_time
    ORDER BY l_time
  ) AS t
  ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
  ORDER BY s.l_time
), 
cte2 AS
(
  SELECT
    DISTINCT
    ctn,
    f_time AS ft,
    MAX(lt) AS lt
  FROM cte1
  GROUP BY ctn, f_time
  ORDER BY f_time
)
SELECT * FROM cte2
ORDER BY ft;
Run Code Online (Sandbox Code Playgroud)

结果:

ctn     ft  lt
707     10:48:25    10:56:17
707     10:53:17    11:00:49
707     10:54:31    10:59:43
707     11:03:59    11:10:39
707     11:04:41    11:08:02
707     11:11:04    11:19:39
707     12:15:15    13:13:13
707     13:04:41    13:20:02
707     13:17:17    13:22:22
707     14:05:17    14:14:14
Run Code Online (Sandbox Code Playgroud)

然后我治疗 cte2在“简单”的解释中将其视为流程的起点。

最终的SQL如下:

WITH cte1 AS 
(
  SELECT s.*, t.mt, t.lt
  FROM sessions s
  JOIN
  (
    SELECT
      DISTINCT 
      ctn,
      MIN(f_time) AS mt,
      l_time AS lt
    FROM sessions
    GROUP BY ctn, l_time
    ORDER BY l_time
  ) AS t
  ON (s.ctn, s.f_time, s.l_time) = (t.ctn, t.mt, t.lt)
  ORDER BY s.l_time
), 
cte2 AS
(
  SELECT
    DISTINCT
    ctn,
    f_time AS ft,
    MAX(lt) AS lt
  FROM cte1
  GROUP BY ctn, f_time
  ORDER BY f_time
)
SELECT 
  ROW_NUMBER() OVER () AS "Interval No.", 
  ' Start time: ' AS " ",
  t3.min_f AS "Interval start" , 
  ' End time: ' AS " ",
  t3.max_l AS "Interval stop", 
  ' Duration: ' AS " ",
  (t3.max_l - t3.min_f) AS "Duration"
FROM
(
  SELECT 
    ROW_NUMBER() OVER (PARTITION BY s) AS rn,
    MIN(ft) OVER (PARTITION BY s ORDER BY ft, lt DESC) AS min_f, 
    MAX(lt) OVER (PARTITION BY s ORDER BY ft DESC, lt) AS max_l,
    s
  FROM
  (
    SELECT 
    t1.ctn, t1.ft, t1.lt, t1.ovl,
    SUM(ovl) OVER (ORDER BY t1.ft ASC ROWS BETWEEN UNBOUNDED PRECEDING 
                                           AND CURRENT ROW) AS s
    FROM
    (
      SELECT 
        c.ctn AS ctn, c.ft AS ft, c.lt AS lt, 
        CASE
          WHEN LAG(c.lt) OVER () > ft THEN 0
          ELSE 1
        END AS ovl
      FROM cte2 c
    ) AS t1
    ORDER BY t1.lt
  ) AS t2
  ORDER BY s, rn ASC, min_f
) AS t3
WHERE t3.rn = 1
ORDER BY t3.rn;
Run Code Online (Sandbox Code Playgroud)

结果:

Interval No.        Interval start      Interval stop       Duration
1    Start time:    10:48:25     End time:  11:00:49     Duration:  00:12:24
2    Start time:    11:03:59     End time:  11:08:02     Duration:  00:04:03
3    Start time:    11:11:04     End time:  11:19:39     Duration:  00:08:35
4    Start time:    12:15:15     End time:  13:22:22     Duration:  01:07:07
5    Start time:    14:05:17     End time:  14:14:14     Duration:  00:08:57
Run Code Online (Sandbox Code Playgroud)

如您所见,这是一件非常麻烦的事情——UNIQUE在 DDL 中没有约束使 SQL 的长度和规划和执行阶段所花费的时间加倍,并且使其变得非常糟糕。

有关两个查询的计划,请参阅小提琴的结尾!在那里要吸取的教训!根据经验,计划越长,查询越慢!

我不确定索引可以在这里发挥任何作用,因为我们是从整个表中进行选择的,而且它非常小!如果我们通过ctnand/orf_day和/or过滤大表f_time,我很确定如果没有索引,我们会开始看到计划(和时间安排!)的差异!