PostgreSQL FTS 和 Trigram-similarity 查询优化

Question

PostgreSQL FTS 和 Trigram-similarity 查询优化

Ank*_*pli 4 postgresql index optimization full-text-search pattern-matching

我最近开始研究PostgreSQL，我有大约1200 万行要处理，我想在其中应用Full Text Search。我之前没有处理此类数据库的任何经验。我已尝试优化查询，但我怀疑它是否已完全优化。

现在我正在使用GIST 索引，因为我读到GIN 索引中的更新速度较慢，并且我的数据库将定期更新。

我现在只需要关注数据库的两列merchant varchar(80)和product varchar(400).

我需要使用 FTS 查找产品，并且即使商家拼写错误，我也正在尝试获取该产品。

我在大约30K行的示例数据库上运行了一些查询，以获得以下结果：

首先，我运行基本的 FTS 查询来分析结果。

explain analyze
select count(*) from products
where to_tsvector('english', product) @@ to_tsquery('hat');

Run Code Online (Sandbox Code Playgroud)

Aggregate  (cost=2027.27..2027.28 rows=1 width=0) (actual time=349.032..349.032 rows=1 loops=1)  
->  Seq Scan on products  (cost=0.00..2026.90 rows=147 width=0) (actual time=43.322..348.961 rows=307 loops=1)
 Filter: (to_tsvector((product)::text) @@ to_tsquery('hat'::text))
Total runtime: 349.140 ms

Run Code Online (Sandbox Code Playgroud)

然后我创建了 GIST 索引并运行相同的查询以查看改进。结果非常好。至少对于我来说。
```
create index product_gist on products using gist(to_tsvector('english', product));
```
Run Code Online (Sandbox Code Playgroud)

Aggregate  (cost=447.17..447.18 rows=1 width=0) (actual time=12.911..12.911 rows=1 loops=1)
->  Bitmap Heap Scan on products  (cost=9.40..446.80 rows=147 width=0) (actual time=2.256..12.776 rows=307 loops=1)
 Recheck Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
 ->  Bitmap Index Scan on pn  (cost=0.00..9.37 rows=147 width=0) (actual time=2.111..2.111 rows=307 loops=1)
       Index Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
Total runtime: 13.051 ms

Run Code Online (Sandbox Code Playgroud)

我还测试了 GIN 指数，结果令人惊讶。Total Runtime: 0.583ms 但是我不能使用 GIN 索引，所以让我们回到 GIST 索引。

现在，除了查找两个单词之间的相似性（将其用于拼写错误的商家）之外，我还使用pg_trgm模块。

create index merchant_trgm on products using gist(merchant gist_trgm_ops);

select count(*) from products
where to_tsvector('english', product) @@ to_tsquery('hat')
AND   similarity(merchant,'fashion') > 0.2;

Run Code Online (Sandbox Code Playgroud)

Aggregate  (cost=447.64..447.65 rows=1 width=0) (actual time=14.644..14.645 rows=1 loops=1)
->  Bitmap Heap Scan on products  (cost=9.38..447.51 rows=49 width=0) (actual time=2.187..14.635 rows=12 loops=1)
 Recheck Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
 Filter: (similarity((merchant)::text, 'fashion'::text) > 0.2::double precision)
 ->  Bitmap Index Scan on product_gist  (cost=0.00..9.37 rows=147 width=0) (actual time=2.055..2.055 rows=307 loops=1)
       Index Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
Total runtime: 14.705 ms

Run Code Online (Sandbox Code Playgroud)

当我在具有1200 万行的数据库上运行这些查询时。显然，这需要更多的时间。任何人都可以帮助我进一步减少总运行时间。

~~我现在脑子里还有几个问题：~~

我如何搜索像“沃尔玛袋子”这样的查询，它会首先向我返回商家沃尔玛的产品袋，然后是其他商家的袋子。

我可以同时使用 GIN 和 GIST 索引吗？

编辑：

我昨晚也运行了这个查询并得到了以下结果。我已经创建了 GIST 索引并且我已经检查过它正在被调用。性能仍然没有达到我的预期。

select count(*) from products 
where (setweight(to_tsvector('english', merchant || ' ' || product), 'A') || 
setweight(to_tsvector('english', product), 'B') ||
setweight(to_tsvector('english', merchant), 'C')) @@ to_tsquery('hat')
AND similarity(merchant,'fashion') > 0.2;

Run Code Online (Sandbox Code Playgroud)

Aggregate (cost=450.97..450.98 rows=1 width=0) (actual time=18.228..18.228 rows=1 loops=1) -> Bitmap Heap Scan on products (cost=9.40..450.84 rows=49 width=0) (actual time=2.399..18.220 rows=12 loops=1) Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (((merchant)::text || ' '::text) || (product)::text)), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (product)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, (merchant)::text), 'C'::"char")) @@ to_tsquery('hat'::text)) Filter: (similarity((merchant)::text, 'fashion'::text) > 0.2::double precision) -> Bitmap Index Scan on products_weighted_index (cost=0.00..9.39 rows=147 width=0) (actual time=2.206..2.206 rows=307 loops=1) Index Cond: (((setweight(to_tsvector('english'::regconfig, (((merchant)::text || ' '::text) || (product)::text)), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (product)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, (merchant)::text), 'C'::"char")) @@ to_tsquery('hat'::text)) Total runtime: 18.289 ms (7 rows)
Run Code Online (Sandbox Code Playgroud)

Answer 1

Erw*_*ter 9

评估

在您的上一个查询中，位图索引扫描寻找 'hat' 产生了 307 次点击。
Postgres 然后运行位图堆扫描来过滤足够相似的商家 ( similarity(...) > 0.2)，产生 12 行。您的测试包含 30K 行，因此您在现实生活中的查询将产生大约 300 倍的点击量，即手头测试用例的 90k / 3.5k。额外的索引merchant会有所帮助。

建议

我建议您为相似性搜索创建一个额外的三元组索引。请务必阅读手册中有关 trigram 索引支持的章节。我们需要安装附加模块pg_trgm（就像您显然拥有的那样）。

对于您的第一个请求：

我如何搜索像“沃尔玛袋子”这样的查询，它会首先向我返回商家沃尔玛的产品袋，然后是其他商家的袋子。

我建议使用相似性运算符进行%此查询：

-- SELECT set_limit(0.2)  -- Adjust similarity operator only if needed

SELECT *
FROM   products
WHERE  to_tsvector('english', product) @@ to_tsquery('bag')
AND    merchant % 'walmart'
ORDER  BY merchant <-> 'walmart'
--    LIMIT  n; -- possibly limit to top n results

Run Code Online (Sandbox Code Playgroud)

同样，您可以在 GiST 和 GIN 之间进行选择，但这一次 GiST 具有决定性的优势：

这可以通过 GiST 索引非常有效地实现，但不能通过 GIN 索引实现。当只需要少数最接近的匹配时，它通常会击败第一个公式。

因此，我建议使用这个索引：

CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);

Run Code Online (Sandbox Code Playgroud)

至于你的第二个要求：

我可以同时使用 GIN 和 GIST 索引吗？

是的你可以。对同一个（组合）列使用这两种类型几乎没有意义，但是 Postgres 可以在同一个查询中很好地组合 GiST 和 GIN 索引。我再次引用了关于组合多个索引的优秀手册：

为了组合多个索引，系统扫描每个需要的索引并在内存中准备一个位图，给出报告为匹配该索引条件的表行的位置。然后根据查询的需要将位图进行 AND 和 OR 运算。最后，访问并返回实际的表行。表行按物理顺序访问，因为这是位图的布局方式；这意味着原始索引的任何排序都将丢失，因此如果查询有ORDER BY子句，则需要单独的排序步骤。出于这个原因，并且因为每个额外的索引扫描都会增加额外的时间，即使额外的索引也可以使用，规划器有时也会选择使用简单的索引扫描。

归档时间：	12 年，5 月前
查看次数：	4509 次
最近记录：	12 年，5 月前