例如,从这样的表开始:
create table t as
select 'A' as x, level as y from dual connect by level<=5
union all
select 'B' as x, level+2 as y from dual connect by level<=5
union all
select 'C' as x, level as y from dual connect by level<=3
union all
select 'D' as x, level+2 as y from dual connect by level<=3;
alter table t add primary key (x, y);
select * from t;
X Y
- -
A 1
A 2
A 3
A 4
A 5
B 3
B 4
B 5
B 6
B 7
C 1
C 2
C 3
D 3
D 4
D 5
Run Code Online (Sandbox Code Playgroud)
我如何得到这个:
SUBSET_X SUPERSET_X
-------- ----------
D A
C A
D B
Run Code Online (Sandbox Code Playgroud)
我正在发布我自己的努力作为答案,但想知道是否还有其他一些奇特的方式,也许是分析或我不知道的集合运算符
- 编辑
我的测试数据无意中暗示这些集合总是由连续的整数组成 - 不幸的是,我的真实数据并非如此。
杰克,这与您的第一种方法类似,但有一些差异。
HAVING
子句中进行。HAVING
是一个后聚合过滤器,这就是为什么将其粘COUNT
在那里很慢的原因。如果您一次查询整个表,那么您的索引策略应该没有那么重要,但如果您对特定集感兴趣,我建议您在(x, y)
和上都有索引(y, x)
。
不管怎样,这个查询应该运行得非常快:
WITH set_sizes AS (
SELECT x, COUNT(*) AS set_size -- "size" is a reserved keyword in Oracle
FROM t
GROUP BY x
)
, intersection_sizes AS (
SELECT
sub.x sub_x
, super.x super_x
, COUNT(*) intersection_size
FROM
t sub
INNER JOIN t super
ON sub.y = super.y
AND sub.x <> super.x
GROUP BY
sub.x
, super.x
)
SELECT xs.sub_x, xs.super_x
FROM
set_sizes ss
INNER JOIN intersection_sizes xs
ON ss.x = xs.sub_x
AND ss.set_size = xs.intersection_size
;
Run Code Online (Sandbox Code Playgroud)
编辑:根据您对大型数据集的测试,看起来这个查询是最快的。