子选择需要“年龄” - EXCEPT要快得多

gue*_*tli 5 performance subquery postgresql-9.3 except postgresql-performance

创建表的脚本

DROP TABLE IF EXISTS history;
CREATE TABLE history (
    id integer NOT NULL,
    ticket_id integer NOT NULL);
ALTER TABLE ONLY history ADD CONSTRAINT history_pkey PRIMARY KEY (id);
CREATE INDEX history_ticket_id ON history USING btree (ticket_id);
DROP TABLE IF EXISTS ticket;
CREATE TABLE ticket (
    id integer NOT NULL
);
ALTER TABLE ONLY ticket ADD CONSTRAINT ticket_pkey PRIMARY KEY (id);
Run Code Online (Sandbox Code Playgroud)

虚拟数据

INSERT INTO history values (generate_series(1, 30000), generate_series(1, 30000));
ANALYZE history;

INSERT INTO ticket values (generate_series(1, 40000));
ANALYZE ticket;
Run Code Online (Sandbox Code Playgroud)

使用子选择查询

explain analyze select distinct ticket_id from history
       where ticket_id not in (select id from ticket);
Run Code Online (Sandbox Code Playgroud)

解释分析慢子选择

     HashAggregate  (cost=15510545.50..15510695.50 rows=15000 width=4) (actual time=170892.668..170892.668 rows=0 loops=1)
   ->  Seq Scan on history  (cost=0.00..15510508.00 rows=15000 width=4) (actual time=170892.644..170892.644 rows=0 loops=1)
         Filter: (NOT (SubPlan 1))
         Rows Removed by Filter: 30000
         SubPlan 1
           ->  Materialize  (cost=0.00..934.00 rows=40000 width=4) (actual time=0.006..2.685 rows=15000 loops=30000)
                 ->  Seq Scan on ticket  (cost=0.00..577.00 rows=40000 width=4) (actual time=0.038..21.347 rows=30000 loops=1)
 Total runtime: 170892.965 ms
Run Code Online (Sandbox Code Playgroud)

用 EXCEPT 查询

explain analyze select distinct ticket_id from history
       except select id from ticket;
Run Code Online (Sandbox Code Playgroud)

用 EXCEPT 解释分析

HashSetOp Except  (cost=0.29..2449.29 rows=30000 width=4) (actual time=41.641..41.641 rows=0 loops=1)
   ->  Append  (cost=0.29..2274.29 rows=70000 width=4) (actual time=0.024..27.835 rows=70000 loops=1)
         ->  Subquery Scan on "*SELECT* 1"  (cost=0.29..1297.29 rows=30000 width=4) (actual time=0.024..14.527 rows=30000 loops=1)
               ->  Unique  (cost=0.29..997.29 rows=30000 width=4) (actual time=0.022..10.856 rows=30000 loops=1)
                     ->  Index Only Scan using history_ticket_id on history  (cost=0.29..922.29 rows=30000 width=4) (actual time=0.021..6.031 rows=30000 loops=1)
                           Heap Fetches: 30000
         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..977.00 rows=40000 width=4) (actual time=0.018..8.364 rows=40000 loops=1)
               ->  Seq Scan on ticket  (cost=0.00..577.00 rows=40000 width=4) (actual time=0.018..3.808 rows=40000 loops=1)
 Total runtime: 41.702 ms
Run Code Online (Sandbox Code Playgroud)

数据库管理系统版本

  • PostgreSQL 9.3.10

问题

  • 为什么一个比另一个花费更长的时间?

小智 1

in对于常量值列表更好。尝试使用not exists替代。

询问:

explain analyze select distinct ticket_id from history h
       where not EXISTS (select id from ticket t where t.id = h.ticket_id);
Run Code Online (Sandbox Code Playgroud)

以及执行计划:

Unique  (cost=0.58..2294.04 rows=1 width=4) (actual time=23.140..23.140 rows=0 loops=1)
  ->  Merge Anti Join  (cost=0.58..2294.04 rows=1 width=4) (actual time=23.139..23.139 rows=0 loops=1)
        Merge Cond: (h.ticket_id = t.id)
        ->  Index Only Scan using history_ticket_id on history h  (cost=0.29..922.29 rows=30000 width=4) (actual time=0.037..6.848 rows=30000 loops=1)
              Heap Fetches: 30000
        ->  Index Only Scan using ticket_pkey on ticket t  (cost=0.29..1228.29 rows=40000 width=4) (actual time=0.026..6.970 rows=30000 loops=1)
              Heap Fetches: 30000
Total runtime: 23.189 ms
Run Code Online (Sandbox Code Playgroud)

我认为原因是NOT INPostgres 需要从表中构建不同的值列表ticket,然后仅过滤historyNOT EXISTS不需要创建列表。它只能检查门票 PK 索引中是否存在值。

通常,当您在此类查询中没有得到“Anti Join”时 - 有些东西写得很糟糕。

  • 我们通常欣赏更详细的答案和合理的主张。与子查询结合使用时,IN 总是很慢吗?为什么?制定了哪些计划?愿意分享一些分析/基准吗?为什么你建议“不存在”?除了使用“IN”和“NOT EXISTS”之外,至少还有两种其他基本方法可以编写此类查询:使用“LEFT JOIN / IS NULL”和“EXCEPT”(OP 所做的)。而且你还没有回答OP的问题。为什么在这种情况下“EXCEPT”解决方案要快得多? (4认同)