SQL分组交互/重叠行

bog*_*man 5 sql postgresql recursive-query common-table-expression

我在那有重叠的两列数据的Postgres下表a_snob_sno.

create table data
( a_sno integer not null,  
  b_sno integer not null,
  PRIMARY KEY (a_sno,b_sno)
);

insert into data (a_sno,b_sno) values
  ( 4, 5 )
, ( 5, 4 )
, ( 5, 6 )
, ( 6, 5 )
, ( 6, 7 )
, ( 7, 6 )
, ( 9, 10)
, ( 9, 13)
, (10, 9 )
, (13, 9 )
, (10, 13)
, (13, 10)
, (10, 14)
, (14, 10)
, (13, 14)
, (14, 13)
, (11, 15)
, (15, 11);
Run Code Online (Sandbox Code Playgroud)

从前6行可以看出,两列中的数据值4,5,6和7相交/重叠需要分区到一个组.行7-16和行17-18也是如此,它们将分别标记为组2和3.

结果输出应如下所示:

group | value
------+------
1     | 4
1     | 5
1     | 6
1     | 7
2     | 9
2     | 10
2     | 13
2     | 14
3     | 11
3     | 15
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 5

假设所有对在它们的镜像的组合中也存在(4,5)(5,4).但是,以下解决方案也可以在没有镜像的情况下工作.

简单的案例

所有的连接都可以按照一个升序顺序排列,并且我在小提琴中添加的复杂功能是不可能的,我们可以在rCTE中使用此解决方案而不重复:

我从a_sno每组获得最小值开始,最小关联b_sno:

SELECT row_number() OVER (ORDER BY a_sno) AS grp
     , a_sno, min(b_sno) AS b_sno
FROM   data d
WHERE  a_sno < b_sno
AND    NOT EXISTS (
   SELECT 1 FROM data
   WHERE  b_sno = d.a_sno
   AND    a_sno < b_sno
   )
GROUP  BY a_sno;
Run Code Online (Sandbox Code Playgroud)

这只需要一个查询级别,因为可以在聚合上构建窗口函数:

结果:

grp  a_sno  b_sno
1    4      5
2    9      10
3    11     15
Run Code Online (Sandbox Code Playgroud)

我避免使用分支和重复(多重)行 - 使用长链可能得多.我ORDER BY b_sno LIMIT 1在一个相关的子查询中使用它来进行递归CTE.

性能的关键是一个匹配的索引,它已经由PK约束提供了PRIMARY KEY (a_sno,b_sno):不是反过来(b_sno, a_sno):

WITH RECURSIVE t AS (
   SELECT row_number() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, min(b_sno) AS b_sno  -- the smallest one
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   GROUP  BY a_sno
   )

, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp
       , (SELECT b_sno  -- correlated subquery
          FROM   data
          WHERE  a_sno = c.sno
          AND    a_sno < b_sno
          ORDER  BY b_sno
          LIMIT  1)
   FROM   cte  c
   WHERE  c.sno IS NOT NULL
   )
SELECT * FROM cte
WHERE  sno IS NOT NULL   -- eliminate row with NULL
UNION  ALL               -- no duplicates
SELECT grp, a_sno FROM t
ORDER  BY grp, sno;
Run Code Online (Sandbox Code Playgroud)

不太简单的情况

所有节点都可以按升序到达,其中一个或多个分支来自根(最小sno).

这一次,获取所有sno可能在最后访问多次的更大和重复数据删除的节点UNION:

WITH RECURSIVE t AS (
   SELECT rank() OVER (ORDER BY d.a_sno) AS grp
        , a_sno, b_sno  -- get all rows for smallest a_sno
   FROM   data d
   WHERE  a_sno < b_sno
   AND    NOT EXISTS (
      SELECT 1 FROM data
      WHERE  b_sno = d.a_sno
      AND    a_sno < b_sno
      )
   )   
, cte AS (
   SELECT grp, b_sno AS sno FROM t

   UNION ALL
   SELECT c.grp, d.b_sno
   FROM   cte  c
   JOIN   data d ON d.a_sno = c.sno
                AND d.a_sno < d.b_sno  -- join to all connected rows
   )
SELECT grp, sno FROM cte
UNION                     -- eliminate duplicates
SELECT grp, a_sno FROM t  -- add first rows
ORDER  BY grp, sno;
Run Code Online (Sandbox Code Playgroud)

与第一个解决方案不同,我们在这里没有得到NULL的最后一行(由相关子查询引起).

两者都应该表现得非常好 - 特别是对于长链/许多分支.结果符合要求:

SQL Fiddle(添加了行来演示难度).

无向图

如果通过递增遍历从根目录无法到达本地最小值,则上述解决方案将不起作用.在这种情况下考虑Farhęg的解决方案.