如何以多次通过关系过滤SQL结果

Xeo*_*oss 95 mysql sql postgresql relational-division sql-match-all

假设我有表student,club以及student_club:

student {
    id
    name
}
club {
    id
    name
}
student_club {
    student_id
    club_id
}
Run Code Online (Sandbox Code Playgroud)

我想知道如何找到足球(30)和棒球(50)俱乐部的所有学生.
虽然这个查询不起作用,但它是我迄今为止最接近的事情:

SELECT student.*
FROM   student
INNER  JOIN student_club sc ON student.id = sc.student_id
LEFT   JOIN club c ON c.id = sc.club_id
WHERE  c.id = 30 AND c.id = 50
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 133

我很好奇.众所周知,好奇心因杀猫而闻名.

那么,这是给猫皮肤最快的方法呢?

这个测试的精确猫皮环境:

  • Debian Squeeze上的PostgreSQL 9.0具有不错的RAM和设置.
  • 6.000名学生,24.000个俱乐部会员资格(从具有真实数据的类似数据库复制的数据.)
  • 从问题的命名架构的轻微转移:student.idstudent.stud_idclub.idclub.club_id在这里.
  • 我在这个帖子中的作者之后命名了查询,索引中有两个.
  • 我运行了几次所有查询来填充缓存,然后我使用EXPLAIN ANALYZE选择了最好的5个.
  • 相关指标(应该是最佳的 - 只要我们缺乏哪些俱乐部将被查询的前瞻性知识):

    ALTER TABLE student ADD CONSTRAINT student_pkey PRIMARY KEY(stud_id );
    ALTER TABLE student_club ADD CONSTRAINT sc_pkey PRIMARY KEY(stud_id, club_id);
    ALTER TABLE club       ADD CONSTRAINT club_pkey PRIMARY KEY(club_id );
    CREATE INDEX sc_club_id_idx ON student_club (club_id);
    
    Run Code Online (Sandbox Code Playgroud)

    club_pkey这里的大多数查询都不需要.
    主键在PostgreSQL中自动实现唯一索引.
    最后一个索引是为了弥补PostgreSQL 上多列索引的这个已知缺点:

多列B树索引可以与涉及索引列的任何子集的查询条件一起使用,但是当前导(最左侧)列存在约束时,索引最有效.

结果:

EXPLAIN ANALYZE的总运行时间.

1)马丁2:44.594毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id IN (30, 50)
GROUP  BY 1,2
HAVING COUNT(*) > 1;
Run Code Online (Sandbox Code Playgroud)

2)欧文1:33.217毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30, 50)
   GROUP  BY 1
   HAVING COUNT(*) > 1
   ) sc USING (stud_id);
Run Code Online (Sandbox Code Playgroud)

3)马丁1:31.735毫秒

SELECT s.stud_id, s.name
   FROM   student s
   WHERE  student_id IN (
   SELECT student_id
   FROM   student_club
   WHERE  club_id = 30
   INTERSECT
   SELECT stud_id
   FROM   student_club
   WHERE  club_id = 50);
Run Code Online (Sandbox Code Playgroud)

4)德里克:2.287毫秒

SELECT s.stud_id,  s.name
FROM   student s
WHERE  s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 30)
AND    s.stud_id IN (SELECT stud_id FROM student_club WHERE club_id = 50);
Run Code Online (Sandbox Code Playgroud)

5)欧文2:2.181毫秒

SELECT s.stud_id,  s.name
FROM   student s
WHERE  EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 30)
AND    EXISTS (SELECT * FROM student_club
               WHERE  stud_id = s.stud_id AND club_id = 50);
Run Code Online (Sandbox Code Playgroud)

6)肖恩:2.043毫秒

SELECT s.stud_id, s.name
FROM   student s
JOIN   student_club x ON s.stud_id = x.stud_id
JOIN   student_club y ON s.stud_id = y.stud_id
WHERE  x.club_id = 30
AND    y.club_id = 50;
Run Code Online (Sandbox Code Playgroud)

最后三个表现几乎相同.4)和5)导致相同的查询计划.

延迟补充:

花哨的SQL,但性能跟不上.

7)ypercube 1:148.649 ms

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM   club AS c 
   WHERE  c.club_id IN (30, 50)
   AND    NOT EXISTS (
      SELECT *
      FROM   student_club AS sc 
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id  
      )
   );
Run Code Online (Sandbox Code Playgroud)

8)ypercube 2:147.497 ms

SELECT s.stud_id,  s.name
FROM   student AS s
WHERE  NOT EXISTS (
   SELECT *
   FROM  (
      SELECT 30 AS club_id  
      UNION  ALL
      SELECT 50
      ) AS c
   WHERE NOT EXISTS (
      SELECT *
      FROM   student_club AS sc 
      WHERE  sc.stud_id = s.stud_id
      AND    sc.club_id = c.club_id  
      )
   );
Run Code Online (Sandbox Code Playgroud)

正如所料,这两者表现几乎相同.查询计划导致表扫描,计划程序在此处找不到使用索引的方法.


9)wildplasser 1:49.849 ms

WITH RECURSIVE two AS (
   SELECT 1::int AS level
        , stud_id
   FROM   student_club sc1
   WHERE  sc1.club_id = 30
   UNION
   SELECT two.level + 1 AS level
        , sc2.stud_id
   FROM   student_club sc2
   JOIN   two USING (stud_id)
   WHERE  sc2.club_id = 50
   AND    two.level = 1
   )
SELECT s.stud_id, s.student
FROM   student s
JOIN   two USING (studid)
WHERE  two.level > 1;
Run Code Online (Sandbox Code Playgroud)

花哨的SQL,CTE的不错表现.非常奇特的查询计划.
再一次,有趣的是9.1如何处理这个问题.我将尽快将此处使用的数据库集群升级到9.1.也许我会重新运行整个shebang ......


10)wildplasser 2:36.986毫秒

WITH sc AS (
   SELECT stud_id
   FROM   student_club
   WHERE  club_id IN (30,50)
   GROUP  BY stud_id
   HAVING COUNT(*) > 1
   )
SELECT s.*
FROM   student s
JOIN   sc USING (stud_id);
Run Code Online (Sandbox Code Playgroud)

查询2)的CTE变体.令人惊讶的是,它可能会导致略有不同的查询计划与完全相同的数据.我发现了顺序扫描student,其中子查询变量使用了索引.


11)ypercube 3:101.482 ms

另一个晚期加入@ypercube.令人惊讶的是,有多少种方式.

SELECT s.stud_id, s.student
FROM   student s
JOIN   student_club sc USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    NOT EXISTS (
   SELECT *
   FROM  (SELECT 14 AS club_id) AS c  -- can't be excluded for missing the 2nd
   WHERE  NOT EXISTS (
      SELECT *
      FROM   student_club AS d
      WHERE  d.stud_id = sc.stud_id
      AND    d.club_id = c.club_id
      )
   )
Run Code Online (Sandbox Code Playgroud)

12)erwin 3:2.377 ms

@ ypercube的11)实际上只是这个更简单的变体的扭曲扭曲的方法,但仍然缺失.表现几乎与顶级猫一样快.

SELECT s.*
FROM   student s
JOIN   student_club x USING (stud_id)
WHERE  sc.club_id = 10                 -- member in 1st club ...
AND    EXISTS (                        -- ... and membership in 2nd exists
   SELECT *
   FROM   student_club AS y
   WHERE  y.stud_id = s.stud_id
   AND    y.club_id = 14
   )
Run Code Online (Sandbox Code Playgroud)

13)erwin 4:2.375 ms

很难相信,但这是另一个真正的新变种.我看到有超过两个会员资格的潜力,但它也是仅有两个会员的顶级猫.

SELECT s.*
FROM   student AS s
WHERE  EXISTS (
   SELECT *
   FROM   student_club AS x
   JOIN   student_club AS y USING (stud_id)
   WHERE  x.stud_id = s.stud_id
   AND    x.club_id = 14
   AND    y.club_id = 10
   )
Run Code Online (Sandbox Code Playgroud)

俱乐部会员资格的动态数量

换句话说:不同数量的过滤器.这个问题恰好要求两个俱乐部会员资格.但是许多用例必须为不同的数量做准备.

这个相关的后续答案中的详细讨论:

  • 考虑到有问题的域和样本大小,我认为200毫秒以下的任何东西都是可接受的性能我错了吗?出于个人兴趣,我使用相同的结构索引和(我认为)数据传播在SQL Server 2008 R2上进行了自己的测试,但是扩展到了一百万学生(对于给定的域,这是一个相当大的集合,我觉得)并且仍然没有IMO,将不同的方法区分开来.当然,基于关系划分的那些可以针对基表,赋予它们"可扩展性"的优势. (3认同)
  • Brandstetter,非常好的工作。我开始悬赏这个问题来给你额外的分数(但我必须等待 24 小时)。不管怎样,我想知道当你开始添加多个club_id而不是两个时这些查询会如何进行...... (2认同)

Sea*_*ean 18

SELECT s.*
FROM student s
INNER JOIN student_club sc_soccer ON s.id = sc_soccer.student_id
INNER JOIN student_club sc_baseball ON s.id = sc_baseball.student_id
WHERE 
 sc_baseball.club_id = 50 AND 
 sc_soccer.club_id = 30
Run Code Online (Sandbox Code Playgroud)


Der*_*omm 10

select *
from student
where id in (select student_id from student_club where club_id = 30)
and id in (select student_id from student_club where club_id = 50)
Run Code Online (Sandbox Code Playgroud)

  • 我最喜欢这个查询,因为它类似于干净的样式,就像sql中的python一样.对于这种代码,我很乐意交易0.44ms(与Sean的查询不同). (5认同)

Pau*_*gan 5

如果您只想要student_id,那么:

    Select student_id
      from student_club
     where club_id in ( 30, 50 )
  group by student_id
    having count( student_id ) = 2
Run Code Online (Sandbox Code Playgroud)

如果您还需要学生的姓名,那么:

Select student_id, name
  from student s
 where exists( select *
                 from student_club sc
                where s.student_id = sc.student_id
                  and club_id in ( 30, 50 )
             group by sc.student_id
               having count( sc.student_id ) = 2 )
Run Code Online (Sandbox Code Playgroud)

如果你在club_selection表中有两个以上的俱乐部,那么:

Select student_id, name
  from student s
 where exists( select *
                 from student_club sc
                where s.student_id = sc.student_id
                  and exists( select * 
                                from club_selection cs
                               where sc.club_id = cs.club_id )
             group by sc.student_id
               having count( sc.student_id ) = ( select count( * )
                                                   from club_selection ) )
Run Code Online (Sandbox Code Playgroud)