按数组重叠分组

Question

按数组重叠分组

我有一个带有 id 和集群的 PostgreSQL 表，如下所示：

CREATE TABLE w (id bigint, clst int);
INSERT INTO w (id,clst)
VALUES 
  (1,0),
  (1,4),
  (2,1),
  (2,2),
  (2,3),
  (3,2),
  (4,2),
  (5,4),
  (6,5);

Run Code Online (Sandbox Code Playgroud)

如果聚合按 id 分组的集群，您可以看到集群数组中存在重叠值：

select id, array_agg(clst) clst from w group by id order by id;
 id |  clst
----+---------
  1 | {0,4}
  2 | {1,2,3}
  3 | {2}
  4 | {2}
  5 | {4}
  6 | {5}

Run Code Online (Sandbox Code Playgroud)

即集群4覆盖id 1和5，集群2覆盖id 2、3和4，而集群5只对应一个id。

我现在如何聚合按集群数组重叠分组的 id？即预期的结果是：

 id      | clst
---------+-------
 {1,5}   | {0,4,4}
 {2,3,4} | {1,2,3,2,2}
 {6}     | {5}

Run Code Online (Sandbox Code Playgroud)

我不太关心簇列，只需要正确聚合 id。

对可能重叠的数量没有限制。每个 id 的集群数量也不受限制（可以是数百个甚至更多）。集群不按顺序分配给 id。

表中有数百万行！！！

使用 PostgreSQL 11。

Answer 1

Jac*_*las 4

我不太关心簇列，只需要正确聚合 id 即可。

在这种情况下，我们可以使用intarray 扩展uniq中的和sort函数：

with recursive a as (
  select id, array_agg(distinct clst) clst from w group by id)
, t(id,pid,clst) as (
  select id,id,clst from a
  union all
  select t.id,a.id,t.clst|a.clst
  from t join a on a.id<>t.pid and t.clst&&a.clst and not t.clst@>a.clst)
, d as (
  select distinct on(id) id, clst from t order by id, cardinality(clst) desc)
select array_agg(id), clst from d group by clst;

Run Code Online (Sandbox Code Playgroud)

数组聚合 | 集线器   
:-------- | :------
{6} | {5}    
{2,3,4} | {1,2,3}
{1,5} | {0,4}

db<>在这里摆弄

请记住，这不太可能在数百万行上表现良好。

归档时间：	6 年，5 月前
查看次数：	584 次
最近记录：	6 年，5 月前