在 Postgresql 查询中有效地选择多个连续范围的开始和结束

Question

在 Postgresql 查询中有效地选择多个连续范围的开始和结束

我在一个表中有大约 10 亿行数据，其中有一个名称和一个 1-288 范围内的整数。对于给定的name，每个int都是唯一的，并且并非该范围内的每个可能的整数都存在——因此存在间隙。

此查询生成一个示例案例：

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) AS baz ("name", "int")

Run Code Online (Sandbox Code Playgroud)

我想为每个名称和连续整数序列生成一个查找表。每个这样的行将包含：

name -- name列的值
start -- 连续序列中的第一个整数
end --连续序列中的最后一个值
span -- end - start + 1

此查询为上述示例生成示例输出：

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
              ('foo', 10, 11, 2),
              ('foo', 13, 13, 1),
              ('bar', 1, 3, 3)
     ) AS contiguous_ranges ("name", "start", "end", span)

Run Code Online (Sandbox Code Playgroud)

因为我有这么多行，效率越高越好。也就是说，我只需要运行一次这个查询，所以这不是一个绝对的要求。

提前致谢！

编辑：

我应该补充一点，PL/pgSQL 解决方案是受欢迎的（请解释任何花哨的技巧——我还是 PL/pgSQL 的新手）。

Answer 1

Jac*_*las 9

怎么用 with recursive

测试视图：

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");

Run Code Online (Sandbox Code Playgroud)

询问：

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;

Run Code Online (Sandbox Code Playgroud)

结果：

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)

Run Code Online (Sandbox Code Playgroud)

我很想知道它在您的十亿行表上的表现如何。

Answer 2

小智 8

您可以使用窗口函数来做到这一点。基本思想是使用lead和lag窗口函数将行拉到当前行的前后。然后我们可以计算是否有序列的开始或结束：

create temp view temp_view as
    select
        n,
        val,
        (lead <> val + 1 or lead is null) as islast,
        (lag <> val - 1 or lag is null) as isfirst,
        (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
    from
    (
        select
            n,
            lead(val, 1) over( partition by n order by n, val),
            lag(val, 1) over(partition by n order by n, val ),
            val
        from test
        order by n, val
    ) as t
;  
select * from temp_view;
 n  | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
 bar |   1 | f      | t       | f
 bar |   2 | f      | f       | f
 bar |   3 | t      | f       | f
 bar |  24 | t      | t       | t
 bar |  42 | t      | t       | t
 foo |   2 | f      | t       | f
 foo |   3 | f      | f       | f
 foo |   4 | t      | f       | f
 foo |  10 | f      | t       | f
 foo |  11 | t      | f       | f
 foo |  13 | t      | t       | t
(11 rows)

Run Code Online (Sandbox Code Playgroud)

（我使用了一个视图，所以下面的逻辑会更容易理解。）所以现在我们知道该行是开始还是结束。我们必须将其折叠成行：

select
    n as "name",
    first,
    coalesce (last, first) as last,
    coalesce (last - first + 1, 1) as span
from
(
    select
    n,
    val as first,
    -- this will not be excellent perf. since were calling the view
    -- for each row sequence found. Changing view into temp table 
    -- will probably help with lots of values.
    (
        select min(val)
        from temp_view as last
        where islast = true
        -- need this since isfirst=true, islast=true on an orphan sequence
        and last.orphan = false
        and first.val < last.val
        and first.n = last.n
    ) as last
    from
        (select * from temp_view where isfirst = true) as first
) as t
;

 name | first | last | span 
------+-------+------+------
 bar  |     1 |    3 |    3
 bar  |    24 |   24 |    1
 bar  |    42 |   42 |    1
 foo  |     2 |    4 |    3
 foo  |    10 |   11 |    2
 foo  |    13 |   13 |    1
(6 rows)

Run Code Online (Sandbox Code Playgroud)

对我来说看起来正确:)

Answer 3

小智 0

一个粗略的计划：

为每个名称选择最小值（按名称分组）
为每个名称选择最小值 2，其中 min2 > min1 并且不存在（子查询： SEL min2-1 ）。
选择 max val1 > min val1，其中 max val1 < min val2。

重复2.直到不再发生更新。从那里开始，事情就变得复杂了，Gordian，对最大分钟和最小最大进行分组。我想我会选择一种编程语言。

PS：一个包含一些样本值的漂亮样本表就可以了，每个人都可以使用它，所以并不是每个人都从头开始创建他的测试数据。

归档时间：	14 年，6 月前
查看次数：	10055 次
最近记录：	6 年，3 月前