Har*_*rry 4 postgresql performance gist-index postgresql-9.5 query-performance
我拥有的是一个非常简单的数据库,用于存储来自 UNC 共享的文件的路径、扩展名和名称。为了测试,我插入了大约 1.5 个 mio 行,下面的查询使用了 GiST 索引,但仍然需要 5 秒才能返回。预计将是几(如 100)毫秒。
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE 'somevalue'
Run Code Online (Sandbox Code Playgroud)
当使用%%
在查询中,它需要的并不长,采用顺序扫描的,即使(?!)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE '%a%'
Run Code Online (Sandbox Code Playgroud)
我对name
(filename) 列也有相同的设置,在对该列执行类似查询时,它只需要一半的时间,即使使用%%
:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE name LIKE '%a%'
Run Code Online (Sandbox Code Playgroud)
我已经尝试过的东西不能用简短的语言写在这里。无论我做什么,它都会从大约 1 mio 行开始变慢。由于基本上从不删除任何内容,因此当然清空和重新索引根本无济于事。除了LIKE %%
GIN 或 GiST 索引,我真的不能使用任何其他类型的搜索,因为我需要能够在感兴趣的列中找到任何字符,而不仅仅是“特定人类语言的单词”。
我是否期望这应该在大约 100 毫秒内工作,即使是错误的多百万行?
重试,没有任何文本或其他索引,1.7 mio 唯一条目
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Seq Scan on residentfiles (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1)
Filter: ((name)::text ~~ '%12345%'::text)
Rows Removed by Filter: 1604780+
Planning time: 0.596 ms
Execution time: 3318.595 ms
Run Code Online (Sandbox Code Playgroud)
尝试使用 Triagram 杜松子酒索引:
CREATE INDEX IF NOT EXISTS name_gin_idx ON residentfiles USING gin (name gin_trgm_ops);
CREATE INDEX IF NOT EXISTS parentpath_gin_idx ON residentfiles USING gin (parentpath gin_trgm_ops);
CREATE INDEX IF NOT EXISTS ext_gin_idx ON residentfiles USING gin (extension gin_trgm_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Aggregate (cost=53717.59..53717.60 rows=1 width=0) (actual time=1694.223..1694.224 rows=1 loops=1)
-> Bitmap Heap Scan on residentfiles (cost=341.89..53631.82 rows=34308 width=0) (actual time=72.010..1615.007 rows=46532 loops=1)
Recheck Cond: ((name)::text ~~ '%12345%'::text)
Rows Removed by Index Recheck: 111
Heap Blocks: exact=46372
-> Bitmap Index Scan on name_gin_idx (cost=0.00..333.31 rows=34308 width=0) (actual time=52.287..52.287 rows=46643 loops=1)
Index Cond: ((name)::text ~~ '%12345%'::text)
Planning time: 10.881 ms
Execution time: 1694.755 ms
Run Code Online (Sandbox Code Playgroud)
尝试使用 varchar_pattern:
CREATE INDEX idx_varchar_pattern_parentpath ON residentfiles (parentpath varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_name ON residentfiles (name varchar_pattern_ops);
CREATE INDEX idx_varchar_pattern_extension ON residentfiles (extension varchar_pattern_ops);
EXPLAIN ANALYZE select * from residentfiles where name like '%12345%'
Aggregate (cost=89718.74..89718.75 rows=1 width=0) (actual time=1995.206..1995.207 rows=1 loops=1)
-> Seq Scan on residentfiles (cost=0.00..89574.98 rows=57507 width=0) (actual time=0.060..1913.114 rows=52232 loops=1)
Filter: ((name)::text ~~ '%12345%'::text)
Rows Removed by Filter: 1852103
Planning time: 8.280 ms
Execution time: 1995.255 ms
Run Code Online (Sandbox Code Playgroud)
小智 5
在我看来,如果您不分享您是如何进行测试的,则很难给您答案。让我们看看我的意思的例子。抱歉我使用了 postgres 11 但结论是一样的:
这是一个新的数据库,没有针对实例运行任何东西:
test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50));
CREATE TABLE
test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING gist (col2 gist_trgm_ops);
CREATE INDEX
Run Code Online (Sandbox Code Playgroud)
使用一个非常简单的循环插入 500000 行。
test=# \i loop_long.sql
DO
test=# select count(1) from test_trgmidx;
count
--------
500000
(1 row)
test=# select * from test_trgmidx limit 20;
col1 | col2
------------+------------
ABCD1EFGH | abcd1efgh
ABCD2EFGH | abcd2efgh
ABCD3EFGH | abcd3efgh
ABCD4EFGH | abcd4efgh
ABCD5EFGH | abcd5efgh
ABCD6EFGH | abcd6efgh
ABCD7EFGH | abcd7efgh
ABCD8EFGH | abcd8efgh
ABCD9EFGH | abcd9efgh
ABCD10EFGH | abcd10efgh
ABCD11EFGH | abcd11efgh
ABCD12EFGH | abcd12efgh
ABCD13EFGH | abcd13efgh
ABCD14EFGH | abcd14efgh
ABCD15EFGH | abcd15efgh
ABCD16EFGH | abcd16efgh
ABCD17EFGH | abcd17efgh
ABCD18EFGH | abcd18efgh
ABCD19EFGH | abcd19efgh
ABCD20EFGH | abcd20efgh
(20 rows)
Run Code Online (Sandbox Code Playgroud)
现在我重新启动实例以获得干净的缓冲区缓存,然后我对第一个选择运行解释两次,以查看缓存如何“干扰”我们的结果:
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=19 read=237
Planning Time: 0.303 ms
Execution Time: 4.934 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.072 ms
Execution Time: 2.317 ms
(6 rows)
Run Code Online (Sandbox Code Playgroud)
很明显,第一次运行需要从磁盘检索行(读取 = 237),而第二次只需要访问缓冲区缓存(共享命中 = 232,无读取)。现在让我们对第二个选择做同样的事情,重新启动实例并运行解释两次:
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared read=3676
Planning Time: 0.719 ms
Execution Time: 94.942 ms
(7 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared hit=3676
Planning Time: 0.081 ms
Execution Time: 65.878 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)
如您所见,现在第一次运行时读取来自磁盘,第二次运行时来自缓冲区。由于操作系统缓冲区,事情变得更加复杂。是否可以清除操作系统缓存并重新运行所有结果,并返回不同的结果:
# free
total used free shared buff/cache available
Mem: 7914604 929920 4105056 93960 2879628 6748994
Swap: 4063228 0 4063228
# echo 3 > /proc/sys/vm/drop_caches
# free
total used free shared buff/cache available
Mem: 7914604 802204 6846392 93960 266008 6951156
Swap: 4063228 0 4063228
#
Run Code Online (Sandbox Code Playgroud)
查看 buff/cache 列,它从 2879628 下降到 266008。现在再次运行解释(任何选择两次):
postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=130.858..140.403 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=19 read=237
Planning Time: 38.448 ms
Execution Time: 140.466 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.386..4.759 rows=1 loops=1)
Output: col1, col2
Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text)
Buffers: shared hit=232
Planning Time: 0.115 ms
Execution Time: 4.787 ms
(6 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=9.214..161.243 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared hit=1 read=3675
Planning Time: 0.090 ms
Execution Time: 165.354 ms
(7 rows)
test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.013..62.043 rows=111111 loops=1)
Output: col1, col2
Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text)
Rows Removed by Filter: 388889
Buffers: shared hit=3676
Planning Time: 0.078 ms
Execution Time: 66.101 ms
(7 rows)
Run Code Online (Sandbox Code Playgroud)
您可以检查现在的统计数据有何不同,首先运行从磁盘读取,第二次从缓冲区缓存运行。所有这些话只是想说明,在我看来,根据我的经验,如果没有关于您如何进行任何解释的所有信息,几乎不可能清楚地了解您的环境中正在发生的事情。即使拥有所有信息,有时也很难找到答案,因为所有变量都在起作用。我的 2 美分