我在一个表中有大约 10 亿行数据,其中有一个名称和一个 1-288 范围内的整数。对于给定的name,每个int都是唯一的,并且并非该范围内的每个可能的整数都存在——因此存在间隙。
此查询生成一个示例案例:
--what I have:
SELECT *
FROM ( VALUES ('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3)
) AS baz ("name", "int")
Run Code Online (Sandbox Code Playgroud)
我想为每个名称和连续整数序列生成一个查找表。每个这样的行将包含:
name -- name列的值
start -- 连续序列中的第一个整数
end --连续序列中的最后一个值
span -- end - start + 1
此查询为上述示例生成示例输出:
--what I need:
SELECT *
FROM ( VALUES ('foo', 2, 4, 3),
('foo', 10, 11, 2),
('foo', 13, 13, 1),
('bar', 1, 3, 3)
) AS contiguous_ranges ("name", "start", "end", span)
Run Code Online (Sandbox Code Playgroud)
因为我有这么多行,效率越高越好。也就是说,我只需要运行一次这个查询,所以这不是一个绝对的要求。
提前致谢!
编辑:
我应该补充一点,PL/pgSQL 解决方案是受欢迎的(请解释任何花哨的技巧——我还是 PL/pgSQL 的新手)。
怎么用 with recursive
测试视图:
create view v as
select *
from ( values ('foo', 2),
('foo', 3),
('foo', 4),
('foo', 10),
('foo', 11),
('foo', 13),
('bar', 1),
('bar', 2),
('bar', 3)
) as baz ("name", "int");
Run Code Online (Sandbox Code Playgroud)
询问:
with recursive t("name", "int") as ( select "name", "int", 1 as span from v
union all
select "name", v."int", t.span+1 as span
from v join t using ("name")
where v."int"=t."int"+1 )
select "name", "start", "start"+span-1 as "end", span
from( select "name", ("int"-span+1) as "start", max(span) as span
from ( select "name", "int", max(span) as span
from t
group by "name", "int" ) z
group by "name", ("int"-span+1) ) z;
Run Code Online (Sandbox Code Playgroud)
结果:
name | start | end | span
------+-------+-----+------
foo | 2 | 4 | 3
foo | 13 | 13 | 1
bar | 1 | 3 | 3
foo | 10 | 11 | 2
(4 rows)
Run Code Online (Sandbox Code Playgroud)
我很想知道它在您的十亿行表上的表现如何。
小智 8
您可以使用窗口函数来做到这一点。基本思想是使用lead
和lag
窗口函数将行拉到当前行的前后。然后我们可以计算是否有序列的开始或结束:
create temp view temp_view as
select
n,
val,
(lead <> val + 1 or lead is null) as islast,
(lag <> val - 1 or lag is null) as isfirst,
(lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as orphan
from
(
select
n,
lead(val, 1) over( partition by n order by n, val),
lag(val, 1) over(partition by n order by n, val ),
val
from test
order by n, val
) as t
;
select * from temp_view;
n | val | islast | isfirst | orphan
-----+-----+--------+---------+--------
bar | 1 | f | t | f
bar | 2 | f | f | f
bar | 3 | t | f | f
bar | 24 | t | t | t
bar | 42 | t | t | t
foo | 2 | f | t | f
foo | 3 | f | f | f
foo | 4 | t | f | f
foo | 10 | f | t | f
foo | 11 | t | f | f
foo | 13 | t | t | t
(11 rows)
Run Code Online (Sandbox Code Playgroud)
(我使用了一个视图,所以下面的逻辑会更容易理解。)所以现在我们知道该行是开始还是结束。我们必须将其折叠成行:
select
n as "name",
first,
coalesce (last, first) as last,
coalesce (last - first + 1, 1) as span
from
(
select
n,
val as first,
-- this will not be excellent perf. since were calling the view
-- for each row sequence found. Changing view into temp table
-- will probably help with lots of values.
(
select min(val)
from temp_view as last
where islast = true
-- need this since isfirst=true, islast=true on an orphan sequence
and last.orphan = false
and first.val < last.val
and first.n = last.n
) as last
from
(select * from temp_view where isfirst = true) as first
) as t
;
name | first | last | span
------+-------+------+------
bar | 1 | 3 | 3
bar | 24 | 24 | 1
bar | 42 | 42 | 1
foo | 2 | 4 | 3
foo | 10 | 11 | 2
foo | 13 | 13 | 1
(6 rows)
Run Code Online (Sandbox Code Playgroud)
对我来说看起来正确:)
小智 0
一个粗略的计划:
重复2.直到不再发生更新。从那里开始,事情就变得复杂了,Gordian,对最大分钟和最小最大进行分组。我想我会选择一种编程语言。
PS:一个包含一些样本值的漂亮样本表就可以了,每个人都可以使用它,所以并不是每个人都从头开始创建他的测试数据。
归档时间: |
|
查看次数: |
10055 次 |
最近记录: |