使用 GiST 索引的 Postgres LIKE 查询与完整扫描一样慢

Question

使用 GiST 索引的 Postgres LIKE 查询与完整扫描一样慢

Har*_*rry 4 postgresql performance gist-index postgresql-9.5 query-performance

我拥有的是一个非常简单的数据库，用于存储来自 UNC 共享的文件的路径、扩展名和名称。为了测试，我插入了大约 1.5 个 mio 行，下面的查询使用了 GiST 索引，但仍然需要 5 秒才能返回。预计将是几（如 100）毫秒。

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE 'somevalue'

Run Code Online (Sandbox Code Playgroud)

当使用%%在查询中，它需要的并不长，采用顺序扫描的，即使（？！）

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  parentpath LIKE '%a%'

Run Code Online (Sandbox Code Playgroud)

我对name(filename) 列也有相同的设置，在对该列执行类似查询时，它只需要一半的时间，即使使用%%：

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles  WHERE  name LIKE '%a%'

Run Code Online (Sandbox Code Playgroud)

我已经尝试过的东西不能用简短的语言写在这里。无论我做什么，它都会从大约 1 mio 行开始变慢。由于基本上从不删除任何内容，因此当然清空和重新索引根本无济于事。除了LIKE %%GIN 或 GiST 索引，我真的不能使用任何其他类型的搜索，因为我需要能够在感兴趣的列中找到任何字符，而不仅仅是“特定人类语言的单词”。

我是否期望这应该在大约 100 毫秒内工作，即使是错误的多百万行？

更多信息

数据库创建脚本

重试，没有任何文本或其他索引，1.7 mio 唯一条目

EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Seq Scan on residentfiles  (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1)
  Filter: ((name)::text ~~ '%12345%'::text)
  Rows Removed by Filter: 1604780+
Planning time: 0.596 ms
Execution time: 3318.595 ms

Run Code Online (Sandbox Code Playgroud)

尝试使用 Triagram 杜松子酒索引：

CREATE INDEX IF NOT EXISTS  name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS  parentpath_gin_idx ON residentfiles USING gin (parentpath gin_trgm_ops);
CREATE INDEX IF NOT EXISTS  ext_gin_idx ON residentfiles USING gin (extension gin_trgm_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'

Aggregate  (cost=53717.59..53717.60 rows=1 width=0) (actual time=1694.223..1694.224 rows=1 loops=1)
  ->  Bitmap Heap Scan on residentfiles  (cost=341.89..53631.82 rows=34308 width=0) (actual time=72.010..1615.007 rows=46532 loops=1)
        Recheck Cond: ((name)::text ~~ '%12345%'::text)
        Rows Removed by Index Recheck: 111
        Heap Blocks: exact=46372
        ->  Bitmap Index Scan on name_gin_idx  (cost=0.00..333.31 rows=34308 width=0) (actual time=52.287..52.287 rows=46643 loops=1)
              Index Cond: ((name)::text ~~ '%12345%'::text)
Planning time: 10.881 ms
Execution time: 1694.755 ms

Run Code Online (Sandbox Code Playgroud)

尝试使用 varchar_pattern：

CREATE INDEX idx_varchar_pattern_parentpath ON residentfiles (parentpath varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_name ON residentfiles (name varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_extension ON residentfiles (extension varchar_pattern_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'

Aggregate  (cost=89718.74..89718.75 rows=1 width=0) (actual time=1995.206..1995.207 rows=1 loops=1)
  ->  Seq Scan on residentfiles  (cost=0.00..89574.98 rows=57507 width=0) (actual time=0.060..1913.114 rows=52232 loops=1)
        Filter: ((name)::text ~~ '%12345%'::text)
        Rows Removed by Filter: 1852103
Planning time: 8.280 ms
Execution time: 1995.255 ms

Run Code Online (Sandbox Code Playgroud)

Answer 1

小智 5

在我看来，如果您不分享您是如何进行测试的，则很难给您答案。让我们看看我的意思的例子。抱歉我使用了 postgres 11 但结论是一样的：

这是一个新的数据库，没有针对实例运行任何东西：

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEX

Run Code Online (Sandbox Code Playgroud)

使用一个非常简单的循环插入 500000 行。

test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
 count
--------
 500000
(1 row)

test=# select * from test_trgmidx limit 20;
col1        |    col2
------------+------------
 ABCD1EFGH  | abcd1efgh
 ABCD2EFGH  | abcd2efgh
 ABCD3EFGH  | abcd3efgh
 ABCD4EFGH  | abcd4efgh
 ABCD5EFGH  | abcd5efgh
 ABCD6EFGH  | abcd6efgh
 ABCD7EFGH  | abcd7efgh
 ABCD8EFGH  | abcd8efgh
 ABCD9EFGH  | abcd9efgh
 ABCD10EFGH | abcd10efgh
 ABCD11EFGH | abcd11efgh
 ABCD12EFGH | abcd12efgh
 ABCD13EFGH | abcd13efgh
 ABCD14EFGH | abcd14efgh
 ABCD15EFGH | abcd15efgh
 ABCD16EFGH | abcd16efgh
 ABCD17EFGH | abcd17efgh
 ABCD18EFGH | abcd18efgh
 ABCD19EFGH | abcd19efgh
 ABCD20EFGH | abcd20efgh
(20 rows)

Run Code Online (Sandbox Code Playgroud)

现在我重新启动实例以获得干净的缓冲区缓存，然后我对第一个选择运行解释两次，以查看缓存如何“干扰”我们的结果：

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30   rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
  Output: col1, col2
  Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
  Buffers: shared hit=19 read=237
 Planning Time: 0.303 ms
 Execution Time: 4.934 ms
(6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
 Output: col1, col2
 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)

Run Code Online (Sandbox Code Playgroud)

很明显，第一次运行需要从磁盘检索行（读取 = 237），而第二次只需要访问缓冲区缓存（共享命中 = 232，无读取）。现在让我们对第二个选择做同样的事情，重新启动实例并运行解释两次：

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared read=3676
 Planning Time: 0.719 ms
 Execution Time: 94.942 ms
(7 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.081 ms
 Execution Time: 65.878 ms
(7 rows)

Run Code Online (Sandbox Code Playgroud)

如您所见，现在第一次运行时读取来自磁盘，第二次运行时来自缓冲区。由于操作系统缓冲区，事情变得更加复杂。是否可以清除操作系统缓存并重新运行所有结果，并返回不同的结果：

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      929920     4105056       93960     2879628     6748994
Swap:       4063228           0     4063228

# echo 3 > /proc/sys/vm/drop_caches

# free
              total        used        free      shared  buff/cache   available
Mem:        7914604      802204     6846392       93960      266008     6951156
Swap:       4063228           0     4063228
#

Run Code Online (Sandbox Code Playgroud)

查看 buff/cache 列，它从 2879628 下降到 266008。现在再次运行解释（任何选择两次）：

postgres=# \c test
You are now connected to database "test" as user "postgres".
test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=130.858..140.403 rows=1 loops=1)
   Output: col1, col2
   Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
   Buffers: shared hit=19 read=237
 Planning Time: 38.448 ms
 Execution Time: 140.466 ms
(6 rows)

test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using trgm_idx_test_col2 on public.test_trgmidx  (cost=0.29..8.30 rows=1 width=28) (actual time=4.386..4.759 rows=1 loops=1)
   Output: col1, col2
   Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
   Buffers: shared hit=232
 Planning Time: 0.115 ms
 Execution Time: 4.787 ms
   (6 rows)

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=9.214..161.243 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=1 read=3675
 Planning Time: 0.090 ms
 Execution Time: 165.354 ms
(7 rows)

test=#  explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
                                                          QUERY PLAN
       ----------------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.test_trgmidx  (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.013..62.043 rows=111111 loops=1)
   Output: col1, col2
   Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
   Rows Removed by Filter: 388889
   Buffers: shared hit=3676
 Planning Time: 0.078 ms
 Execution Time: 66.101 ms
   (7 rows)

Run Code Online (Sandbox Code Playgroud)

您可以检查现在的统计数据有何不同，首先运行从磁盘读取，第二次从缓冲区缓存运行。所有这些话只是想说明，在我看来，根据我的经验，如果没有关于您如何进行任何解释的所有信息，几乎不可能清楚地了解您的环境中正在发生的事情。即使拥有所有信息，有时也很难找到答案，因为所有变量都在起作用。我的 2 美分

归档时间：	6 年，9 月前
查看次数：	2085 次
最近记录：	6 年，9 月前