如何使用 pg_trgm 改进或加速 Postgres 查询？

Question

如何使用 pg_trgm 改进或加速 Postgres 查询？

Dmi*_*ich 6 postgresql postgresql-performance pg-trgm

我可以采取任何其他步骤来加快查询执行速度吗？

\n

我有一个超过 100m 行的表，我需要搜索匹配的字符串。为此，我检查了两个选项：

\n

将文本与 to_tsvector @@（to_tsquery 或 plainto_tsquery）进行比较
\n这工作得非常快（所有数据都在 1 秒以下），但在查找文本相似性方面存在一些问题
将文本与 pg_trgm 相似度进行比较\n这对于文本比较效果很好，但对于大量数据则效果不佳。

\n

我发现我可以使用索引来提高性能。\n对于我的 GiST 索引，我尝试siglen从小数字增加到 2024，但由于某种原因 Postgres 使用512而不是更高。

\n

CREATE INDEX trgm_idx_512_gg ON table USING GIST (name gist_trgm_ops(siglen=512));\n

Run Code Online (Sandbox Code Playgroud)\n

询问：

\n

SELECT name, similarity(name, '\xd0\xbd\xd0\xbe\xd1\x83\xd1\x82\xd0\xb1\xd1\x83\xd0\xba MSI GF63 Thin 10SC 086XKR 9S7 16R512 086') as sm\nFROM table\nWHERE name % '\xd0\xbd\xd0\xbe\xd1\x83\xd1\x82\xd0\xb1\xd1\x83\xd0\xba MSI GF63 Thin 10SC 086XKR 9S7 16R512 086' \n

Run Code Online (Sandbox Code Playgroud)\n

EXPLAIN输出：

\n

Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126)\n  Recheck Cond: ((name)::text % '\xd0\xbd\xd0\xbe\xd1\x83\xd1\x82\xd0\xb1\xd1\x83\xd0\xba MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)\n  ->  Bitmap Index Scan on trgm_idx_512_gg  (cost=0.00..1629.57 rows=9737 width=0)\n        Index Cond: ((name)::text % '\xd0\xbd\xd0\xbe\xd1\x83\xd1\x82\xd0\xb1\xd1\x83\xd0\xba MSI GF63 Thin 10SC 086XKR 9S7 16R512 086'::text)\n

Run Code Online (Sandbox Code Playgroud)\n

执行时间约为 120 秒。

\n

问题

\n

如何改进或加快查询速度？也许我需要使用不同的方法或者只是添加其他东西？

\n

输出EXPLAIN (ANALYZE, BUFFERS)（搜索不同的名称，以便搜索是全新的而不是来自缓存）：

\n

Bitmap Heap Scan on table (cost=1632.01..40051.57 rows=9737 width=126) (actual time=159119.258..159960.251 rows=5645 loops=1)\n  Recheck Cond: ((name)::text % '\xd0\xa7\xd0\xb5\xd1\x85\xd0\xbe\xd0\xbb \xd0\xbd\xd0\xb0 realme C25s / \xd0\xa0\xd0\xb5\xd0\xb0\xd0\xbb\xd0\xbc\xd0\xb8 \xd0\xa625\xd1\x81 c \xd1\x80\xd0\xb8\xd1\x81\xd1\x83\xd0\xbd\xd0\xba\xd0\xbe\xd0\xbc / \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb7\xd1\x80\xd0\xb0\xd1\x87\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81 \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbd\xd1\x82\xd0\xbe\xd0\xbc, Andy&Paul'::text)\n  Heap Blocks: exact=3795\n  Buffers: shared read=1289378\n  ->  Bitmap Index Scan on trgm_idx_512_gg  (cost=0.00..1629.57 rows=9737 width=0) (actual time=159118.616..159118.616 rows=5645 loops=1)\n        Index Cond: ((name)::text % '\xd0\xa7\xd0\xb5\xd1\x85\xd0\xbe\xd0\xbb \xd0\xbd\xd0\xb0 realme C25s / \xd0\xa0\xd0\xb5\xd0\xb0\xd0\xbb\xd0\xbc\xd0\xb8 \xd0\xa625\xd1\x81 c \xd1\x80\xd0\xb8\xd1\x81\xd1\x83\xd0\xbd\xd0\xba\xd0\xbe\xd0\xbc / \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb7\xd1\x80\xd0\xb0\xd1\x87\xd0\xbd\xd1\x8b\xd0\xb9 \xd1\x81 \xd0\xbf\xd1\x80\xd0\xb8\xd0\xbd\xd1\x82\xd0\xbe\xd0\xbc, Andy&Paul'::text)\n        Buffers: shared read=1285583\nPlanning:\n  Buffers: shared read=5\nPlanning Time: 4.063 ms\nExecution Time: 159961.121 ms\n

Run Code Online (Sandbox Code Playgroud)\n

我还创建了一个 GIN 索引（但 Postgres 继续使用 GiST）：

\n

CREATE INDEX gin_gg ON table USING GIN (name gin_trgm_ops);\n

Run Code Online (Sandbox Code Playgroud)\n

大小：12 GB。

\n

GIST索引：31GB

\n

Answer 1

Erw*_*ter 3

具有 100m 行的 trigram GiST 索引siglen=512非常大，并且可能永远不会被有效地缓存。（默认值为siglen=1212 字节。）是什么让您认为这个大签名是一个不错的选择？手册：

较长的签名会导致更精确的搜索（扫描索引的较小部分和较少的堆页），但代价是索引较大。

看起来你对尺寸太过分了。

我对 trigram GIN 索引有更好的经验，尤其是在当前版本的 Postgres 中。如果查询规划器对附加 GiST 索引的存在感到困惑，则必须删除该索引，以使用 GIN 索引测试结果。

但首先，要进行大小比较，请查看以下输出：

SELECT i.indexrelid::regclass::text AS idx
     , pg_get_indexdef(i.indexrelid) AS idx_def
     , pg_size_pretty(pg_relation_size(i.indexrelid)) AS idx_size
FROM   pg_class t
JOIN   pg_index i ON i.indrelid = t.oid
WHERE  t.oid = 'public.tbl'::regclass  -- your table name here!
ORDER  BY 1;

Run Code Online (Sandbox Code Playgroud)

（理想情况下，将结果添加到问题中。）

您的查询计划显示了大量的Buffers: shared read索引和主关系（堆）。所以在缓存中没有找到任何内容。获得更好性能的关键是读取更少的数据页来满足查询，并从缓存中读取更多数据页：hit而不是read在查询计划中。

减少表和索引的大小在这方面有所帮助。

三元相似算子的选择性由定制选项%设置。默认值相当宽松，允许多次点击。较高的相似度阈值将过滤更少（更好匹配）的结果行。无论如何，您如何处理结果行？尝试：pg_trgm.similarity_threshold0.3rows=5645

SET pg_trgm.similarity_threshold = 0.5;  -- or higher

Run Code Online (Sandbox Code Playgroud)

然后重试您的查询。
看：

最新版本的 Postgres、更好的服务器配置和更多的 RAM 在这方面也能有所帮助。您没有透露任何有关这些的信息。

归档时间：	3 年前
查看次数：	1607 次
最近记录：	3 年前