使用 SIMILAR TO 和许多前缀优化查询

Fon*_*exn 5 postgresql performance index postgresql-performance

我在 PostgreSQL 9.6中有一个查询需要很长时间才能运行:

SELECT DISTINCT ON (gc.number)
       gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc, client_hierarchy ch
WHERE gc.code = ch.client_id
AND gc.number in (SELECT NUMB FROM directory
                  WHERE ACTIVE = TRUE
                  AND NUMB SIMILAR TO '(ABTR|GREW|POEW)%');
Run Code Online (Sandbox Code Playgroud)

SIMILAR TO接收许多不同NUMB的作为参数。ABTR|GREW|POEW大约 3.500 多。

表定义:

CREATE TABLE directory (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    numb VARCHAR(8),
    branch VARCHAR(20),
    city VARCHAR(50),
    modified_date TIMESTAMP NOT NULL,
    name VARCHAR(200)
);
CREATE INDEX dir_numb_index ON directory (numb);
CREATE INDEX dir_branch_index ON bank_directory (branch);
CREATE INDEX dir_city_index ON bank_directory (city);
CREATE INDEX dir_name_index ON bank_directory (name);

CREATE TABLE client (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    number VARCHAR(8),
    code VARCHAR(9) NOT NULL,
    modified_date TIMESTAMP NOT NULL,
    name VARCHAR(200),
);
CREATE INDEX client_number_index ON client (number);
CREATE INDEX client_code_index ON client (code);
CREATE INDEX client_name_index ON client (name);

CREATE TABLE client_hierarchy (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    client_id VARCHAR(9),
    client_parent_id VARCHAR(9),
    modified_date TIMESTAMP NOT NULL
);
CREATE INDEX client_id_hierarchy_index ON client_hierarchy (client_id);
CREATE INDEX client_parent_id_hierarchy_index ON client_hierarchy (client_parent_id);
Run Code Online (Sandbox Code Playgroud)

client有 5K 行和client_hierarchy900K 行。

的输出EXPLAIN (ANALYZE, BUFFERS)

Unique  (cost=21869.20..21890.98 rows=3760 width=23) (actual time=3117.734..3118.110 rows=707 loops=1)
  Buffers: shared hit=7786
  ->  Sort  (cost=21869.20..21880.09 rows=4357 width=23) (actual time=3117.733..3117.869 rows=998 loops=1)
        Sort Key: gc.number
        Sort Method: quicksort  Memory: 102kB
        Buffers: shared hit=7786
        ->  Hash Semi Join  (cost=1179.86..21605.84 rows=4357 width=23) (actual time=2825.705..3115.586 rows=998 loops=1)
              Hash Cond: ((gc.number)::text = (directory.numb)::text)
              Buffers: shared hit=7786
              ->  Hash Join  (cost=158.03..20524.10 rows=4357 width=23) (actual time=2.101..311.200 rows=3777 loops=1)
                    Hash Cond: ((ch.client_id)::text = (gc.code)::text)
                    Buffers: shared hit=7280
                    ->  Seq Scan on client_hierarchy ch  (cost=0.00..15955.00 rows=873500 width=10) (actual time=0.008..161.576 rows=873349 loops=1)
                          Filter: active
                          Buffers: shared hit=7220
                    ->  Hash  (cost=103.57..103.57 rows=4357 width=13) (actual time=1.969..1.969 rows=4391 loops=1)
                          Buckets: 8192  Batches: 1  Memory Usage: 262kB
                          Buffers: shared hit=60
                          ->  Seq Scan on gbo_client gc  (cost=0.00..103.57 rows=4357 width=13) (actual time=0.004..1.016 rows=4391 loops=1)
                                Filter: active
                                Buffers: shared hit=60
              ->  Hash  (cost=953.95..953.95 rows=5430 width=9) (actual time=2803.112..2803.112 rows=5563 loops=1)
                    Buckets: 8192  Batches: 1  Memory Usage: 287kB
                    Buffers: shared hit=506
                    ->  Seq Scan on directory  (cost=0.00..953.95 rows=5430 width=9) (actual time=1.353..2800.528 rows=5563 loops=1)
                          Filter: (active AND ((numb)::text ~ '^(?:(?:AACM|AACO).*)$'::text))
                          Rows Removed by Filter: 30318
                          Buffers: shared hit=506
Planning time: 8.529 ms
Execution time: 3118.224 ms
Run Code Online (Sandbox Code Playgroud)

不需要保证顺序,只需要删除重复的行 DISTINCT ON,因为是同一个客户端......它们是脏数据。

有没有可能优化这个?

解决方案

根据这里的反馈,我的新查询:

SELECT client.numb, client.code, ch.client_id, ch.client_parent_id
 FROM (
      SELECT DISTINCT ON (gc.number) gc.number, gc.code
      FROM client gc
      WHERE EXISTS (
                    SELECT *
                    FROM client_hierarchy ch
                    WHERE ch.client_id = gc.code
                    )
      AND EXISTS (
                    SELECT *
                    FROM bank_directory
                    WHERE active AND numb = gc.number
      )
      AND numb ~ '^(?:AACM|AACO)'
      ) as client
 INNER JOIN client_hierarchy ch ON client.code = ch.client_id;
Run Code Online (Sandbox Code Playgroud)

EXPLAIN (ANALYZE, BUFFERS)

Nested Loop  (cost=11080.16..20448.74 rows=1251 width=23) (actual time=368.303..373.574 rows=707 loops=1)
  Buffers: shared hit=8236
  ->  Unique  (cost=11079.73..11086.15 rows=1251 width=13) (actual time=368.286..368.684 rows=707 loops=1)
        Buffers: shared hit=5402
        ->  Sort  (cost=11079.73..11082.94 rows=1284 width=13) (actual time=368.285..368.410 rows=998 loops=1)
              Sort Key: gc.number
              Sort Method: quicksort  Memory: 71kB
              Buffers: shared hit=5402
              ->  Hash Semi Join  (cost=1312.74..11013.44 rows=1284 width=13) (actual time=19.826..366.642 rows=998 loops=1)
                    Hash Cond: ((gc.number)::text = (directory.numb)::text)
                    Buffers: shared hit=5402
                    ->  Nested Loop Semi Join  (cost=0.42..9683.47 rows=1284 width=13) (actual time=0.987..346.895 rows=1070 loops=1)
                          Buffers: shared hit=4896
                          ->  Seq Scan on client gc  (cost=0.00..114.46 rows=1284 width=13) (actual time=0.857..327.979 rows=1252 loops=1)
                                Filter: ((numb)::text ~ '^(?:AACM|AACO)'::text)
                                Rows Removed by Filter: 3139
                                Buffers: shared hit=60
                          ->  Index Only Scan using client_id_hierarchy_index on client_hierarchy ch_1  (cost=0.42..7.44 rows=1 width=5) (actual time=0.014..0.014 rows=1 loops=1252)
                                Index Cond: (client_id = (gc.code)::text)
                                Heap Fetches: 1070
                                Buffers: shared hit=4836
                    ->  Hash  (cost=864.36..864.36 rows=35836 width=9) (actual time=18.790..18.790 rows=35881 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 1949kB
                          Buffers: shared hit=506
                          ->  Seq Scan on directory  (cost=0.00..864.36 rows=35836 width=9) (actual time=0.010..11.200 rows=35881 loops=1)
                                Filter: active
                                Buffers: shared hit=506
  ->  Index Scan using client_id_hierarchy_index on client_hierarchy ch  (cost=0.42..7.46 rows=1 width=10) (actual time=0.006..0.006 rows=1 loops=707)
        Index Cond: ((client_id)::text = (gc.code)::text)
        Buffers: shared hit=2834
Planning time: 63.201 ms
Execution time: 373.721 ms
Run Code Online (Sandbox Code Playgroud)

现在我在 300-400 毫秒内得到结果。从 3.5 秒下降。

Eva*_*oll 1

我们实际上没有查询的选择性和成本来优化它。您需要EXPLAIN (ANALYZE, BUFFERS)像评论中提到的那样。说了这么多,有几点..

  • WHERE ACTIVE = TRUE正在触发 seq 扫描(WHERE active为了清楚起见,您也可以这样说)。您可以在活动上添加索引,这可能会产生更快的结果。
  • 您可以重写查询,根据内部查询的返回EXISTS方式,使用哪个查询可能会更快。numb
  • 根据 的域numb, 可能适合将 变成enum更快。
  • 你打电话DISTINCT ON。您可能会发现在这里进行两个现有测试会更快。

像这样

SELECT gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc
WHERE EXISTS (
  SELECT *
  FROM client_hierarchy ch
  WHERE ch.client_id = gc.code
)
AND EXISTS (
  SELECT *
  FROM directory
  WHERE active
    AND numb = gc.number
    AND numb SIMILAR TO '(ABTR|GREW|POEW)%'
);
Run Code Online (Sandbox Code Playgroud)

而且,这样写就很明显,这很尴尬,可以推下来。

    AND numb = gc.number
    AND numb SIMILAR TO '(ABTR|GREW|POEW)%'
Run Code Online (Sandbox Code Playgroud)

进入这个..

SELECT gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc
WHERE EXISTS (
  SELECT *
  FROM client_hierarchy ch
  WHERE ch.client_id = gc.code
)
AND EXISTS (
  SELECT *
  FROM directory
  WHERE active
    AND numb = gc.number
)
AND number SIMILAR TO '(ABTR|GREW|POEW)%';
Run Code Online (Sandbox Code Playgroud)

最后SIMILAR TO可能永远不应该使用。有关更多信息,请参阅 Erwin 的这篇精彩文章