提高PostgreSQL集的差异效率

Bar*_*lly 2 sql database postgresql performance

我正在尝试在PostgreSQL 9.3中进行一些设置操作.

我有两个表,为了简单起见,我们姑且称之为table_atable_b:

create table table_a(id varchar primary key);
create table table_b(id varchar primary key);
Run Code Online (Sandbox Code Playgroud)

我有一个简单的查询(最简单的公式,虽然它是实践中插入的来源):

(select id from table_a) except (select id from table_b);
Run Code Online (Sandbox Code Playgroud)

在我开始使用PostgreSQL之前,我会做一个这样的操作:

set-diff table_a.csv table_b.csv > table_c.csv
Run Code Online (Sandbox Code Playgroud)

set-diff的外观大致如下:

while (not eof(a)) and (not eof(b)):
  line_a <- peek_line(a)
  line_b <- peek_line(b)
  if line_a < line_b:
    output read_line(a)
  else if line_a == line_b:
    read_line(a)
  else:
    read_line(b)
while not eof(a):
  output read_line(a)
Run Code Online (Sandbox Code Playgroud)

这不需要很长时间,具有无关紧要的内存要求,并最大限度地有效使用顺序磁盘I/O. 这很重要,因为这台机器没有大量内存 - 它无法容纳RAM中的所有数据.

但是,PostgreSQL提出了这种计划(来自一些实际的表):

                                    QUERY PLAN
----------------------------------------------------------------------------------
 SetOp Except  (cost=3184554.28..3238904.44 rows=9434298 width=51)
   ->  Sort  (cost=3184554.28..3211729.36 rows=10870032 width=51)
         Sort Key: "*SELECT* 1".id
         ->  Append  (cost=0.00..428039.64 rows=10870032 width=51)
               ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..345707.96 rows=9434298 width=54)
                     ->  Seq Scan on table_a  (cost=0.00..251364.98 rows=9434298 width=54)
               ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..82331.68 rows=1435734 width=32)
                     ->  Seq Scan on table_b  (cost=0.00..67974.34 rows=1435734 width=32)
Run Code Online (Sandbox Code Playgroud)

查询需要太长时间 - 几分钟.

我确信PostgreSQL可以使用我在上面概述的相同类型的合并策略,仅使用索引扫描,而不使用排序.相反,它似乎是连接两个表扫描并对它们进行排序,有点像这个命令行,尽管没有读取table_b两次:

sort table_a.csv table_b.csv table_b.csv | uniq -u
Run Code Online (Sandbox Code Playgroud)

这涉及到相当多的额外工作 - 一部分log(n)倍I/O,一部分,当一切都不适合内存时.

涉及的列是btree索引.从查询中选择的唯一列与索引并正在合并的列相同.Locale到处都是C.

在我使用大量文本文件和一些自定义索引工具之前.我正在尝试使用数据库来获得额外的查询灵活性,并避免维护自定义索引.然而,性能令人震惊,以至于我正在考虑在数据库之外进行合并和大多数其他大规模更新操作,通过csv对数据进行四舍五入.

我错过了什么?

poz*_*ozs 5

初步想法:

  • 普通EXCEPT意味着EXCEPT DISTINCT它意味着它从结果中消除了重复的行.EXCEPT ALL如果可以,请使用,它应该更快.
  • 如果您还有其他选项,请不要使用组合查询,已知它们很慢.
  • 从您的角度来看,您EXPLAIN似乎也应用了订购,这也需要更多时间(特别是在组合查询时).

我的结果9.2:

EXCEPT

explain select id from table_a except (select id from table_b);
Run Code Online (Sandbox Code Playgroud)

结果:

HashSetOp Except  (cost=0.00..947.00 rows=20000 width=5)
  ->  Append  (cost=0.00..872.00 rows=30000 width=5)
        ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..563.00 rows=20000 width=5)
              ->  Seq Scan on table_a  (cost=0.00..363.00 rows=20000 width=5)
        ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..309.00 rows=10000 width=4)
              ->  Seq Scan on table_b  (cost=0.00..209.00 rows=10000 width=4)
Run Code Online (Sandbox Code Playgroud)

EXCEPTORDER BY

explain select id from table_a except (select id from table_b) order by id;
Run Code Online (Sandbox Code Playgroud)

结果:

Sort  (cost=2375.77..2425.77 rows=20000 width=5)
  Sort Key: "*SELECT* 1".id
  ->  HashSetOp Except  (cost=0.00..947.00 rows=20000 width=5)
        ->  Append  (cost=0.00..872.00 rows=30000 width=5)
              ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..563.00 rows=20000 width=5)
                    ->  Seq Scan on table_a  (cost=0.00..363.00 rows=20000 width=5)
              ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..309.00 rows=10000 width=4)
                    ->  Seq Scan on table_b  (cost=0.00..209.00 rows=10000 width=4)
Run Code Online (Sandbox Code Playgroud)

JOINORDER BY

explain select table_a.id from table_a
left outer join table_b on table_a.id = table_b.id
where table_b.id is null order by table_a.id;
Run Code Online (Sandbox Code Playgroud)

explain select id from table_a
where not exists (select * from table_b where table_b.id = table_a.id) order by id;
Run Code Online (Sandbox Code Playgroud)

结果(相同):

Merge Anti Join  (cost=0.57..1213.57 rows=10000 width=5)
  Merge Cond: ((table_a.id)::text = (table_b.id)::text)
  ->  Index Only Scan using table_a_pkey on table_a  (cost=0.29..688.29 rows=20000 width=5)
  ->  Index Only Scan using table_b_pkey on table_b  (cost=0.29..350.29 rows=10000 width=4)
Run Code Online (Sandbox Code Playgroud)

NOT INORDER BY

explain select id from table_a where id not in (select id from table_b) order by id;
Run Code Online (Sandbox Code Playgroud)

结果(我的赢家):

Seq Scan on table_a  (cost=234.00..647.00 rows=10000 width=5)
  Filter: (NOT (hashed SubPlan 1))
  SubPlan 1
    ->  Seq Scan on table_b  (cost=0.00..209.00 rows=10000 width=4)
Run Code Online (Sandbox Code Playgroud)

用过的

create table table_a(id varchar primary key, rnd float default random());
create table table_b(id varchar primary key, rnd float default random());

do language plpgsql $$
begin
    for i in 1 .. 10000 loop
        insert into table_a(id) values (i);
        insert into table_b(id) values (i);
    end loop;
    for i in 10001 .. 20000 loop
        insert into table_a(id) values (i);
    end loop;
end;
$$;
Run Code Online (Sandbox Code Playgroud)