在 Postgresql 查询中有效地选择多个连续范围的开始和结束

Ste*_*tew 20 postgresql query

我在一个表中有大约 10 亿行数据,其中有一个名称和一个 1-288 范围内的整数。对于给定的name,每个int都是唯一的,并且并非该范围内的每个可能的整数都存在——因此存在间隙。

此查询生成一个示例案例:

--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) AS baz ("name", "int")
Run Code Online (Sandbox Code Playgroud)

我想为每个名称和连续整数序列生成一个查找表。每个这样的行将包含:

name -- name列的值
start -- 连续序列中的第一个整数
end --连续序列中的最后一个值
span -- end - start + 1

此查询为上述示例生成示例输出:

--what I need:
SELECT * 
FROM ( VALUES ('foo', 2, 4, 3),
              ('foo', 10, 11, 2),
              ('foo', 13, 13, 1),
              ('bar', 1, 3, 3)
     ) AS contiguous_ranges ("name", "start", "end", span)
Run Code Online (Sandbox Code Playgroud)

因为我有这么多行,效率越高越好。也就是说,我只需要运行一次这个查询,所以这不是一个绝对的要求。

提前致谢!

编辑:

我应该补充一点,PL/pgSQL 解决方案是受欢迎的(请解释任何花哨的技巧——我还是 PL/pgSQL 的新手)。

Jac*_*las 9

怎么用 with recursive

测试视图:

create view v as 
select *
from ( values ('foo', 2),
              ('foo', 3),
              ('foo', 4),
              ('foo', 10),
              ('foo', 11),
              ('foo', 13),
              ('bar', 1),
              ('bar', 2),
              ('bar', 3)
     ) as baz ("name", "int");
Run Code Online (Sandbox Code Playgroud)

询问:

with recursive t("name", "int") as ( select "name", "int", 1 as span from v
                                     union all
                                     select "name", v."int", t.span+1 as span
                                     from v join t using ("name")
                                     where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
      from ( select "name", "int", max(span) as span 
             from t
             group by "name", "int" ) z
      group by "name", ("int"-span+1) ) z;
Run Code Online (Sandbox Code Playgroud)

结果:

 name | start | end | span
------+-------+-----+------
 foo  |     2 |   4 |    3
 foo  |    13 |  13 |    1
 bar  |     1 |   3 |    3
 foo  |    10 |  11 |    2
(4 rows)
Run Code Online (Sandbox Code Playgroud)

我很想知道它在您的十亿行表上的表现如何。


小智 8

您可以使用窗口函数来做到这一点。基本思想是使用leadlag窗口函数将行拉到当前行的前后。然后我们可以计算是否有序列的开始或结束:

create temp view temp_view as
    select
        n,
        val,
        (lead <> val + 1 or lead is null) as islast,
        (lag <> val - 1 or lag is null) as isfirst,
        (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
    from
    (
        select
            n,
            lead(val, 1) over( partition by n order by n, val),
            lag(val, 1) over(partition by n order by n, val ),
            val
        from test
        order by n, val
    ) as t
;  
select * from temp_view;
 n  | val | islast | isfirst | orphan 
-----+-----+--------+---------+--------
 bar |   1 | f      | t       | f
 bar |   2 | f      | f       | f
 bar |   3 | t      | f       | f
 bar |  24 | t      | t       | t
 bar |  42 | t      | t       | t
 foo |   2 | f      | t       | f
 foo |   3 | f      | f       | f
 foo |   4 | t      | f       | f
 foo |  10 | f      | t       | f
 foo |  11 | t      | f       | f
 foo |  13 | t      | t       | t
(11 rows)
Run Code Online (Sandbox Code Playgroud)

(我使用了一个视图,所以下面的逻辑会更容易理解。)所以现在我们知道该行是开始还是结束。我们必须将其折叠成行:

select
    n as "name",
    first,
    coalesce (last, first) as last,
    coalesce (last - first + 1, 1) as span
from
(
    select
    n,
    val as first,
    -- this will not be excellent perf. since were calling the view
    -- for each row sequence found. Changing view into temp table 
    -- will probably help with lots of values.
    (
        select min(val)
        from temp_view as last
        where islast = true
        -- need this since isfirst=true, islast=true on an orphan sequence
        and last.orphan = false
        and first.val < last.val
        and first.n = last.n
    ) as last
    from
        (select * from temp_view where isfirst = true) as first
) as t
;

 name | first | last | span 
------+-------+------+------
 bar  |     1 |    3 |    3
 bar  |    24 |   24 |    1
 bar  |    42 |   42 |    1
 foo  |     2 |    4 |    3
 foo  |    10 |   11 |    2
 foo  |    13 |   13 |    1
(6 rows)
Run Code Online (Sandbox Code Playgroud)

对我来说看起来正确:)


小智 0

一个粗略的计划:

  • 为每个名称选择最小值(按名称分组)
  • 为每个名称选择最小值 2,其中 min2 > min1 并且不存在(子查询: SEL min2-1 )。
  • 选择 max val1 > min val1,其中 max val1 < min val2。

重复2.直到不再发生更新。从那里开始,事情就变得复杂了,Gordian,对最大分钟和最小最大进行分组。我想我会选择一种编程语言。

PS:一个包含一些样本值的漂亮样本表就可以了,每个人都可以使用它,所以并不是每个人都从头开始创建他的测试数据。