Postgres 在数组列上进行全文搜索，带索引

Question

Postgres 在数组列上进行全文搜索，带索引

sti*_*ure 10 postgresql full-text-search tsvector

使用 Postgres，我想使用索引执行包含数组列的全文搜索。让我们从一个假设的模式开始：

CREATE TABLE book (title TEXT, tags TEXT[]);
-- tags are lowercase a-z, dashes, and $

Run Code Online (Sandbox Code Playgroud)

我们想要一个在标题和标签中搜索某些文本的查询。许多 SO 答案推荐的具有合理语义的简单查询是：

SELECT * 
FROM book 
WHERE to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm');

Run Code Online (Sandbox Code Playgroud)

这样可行。标签中的破折号和美元符号实际上不再存在，但这对于此应用程序来说是可以的。然而，我们有数百万条记录，需要一个索引：

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', array_to_string(tags, ' ')) || to_tsvector('simple', title)));

Run Code Online (Sandbox Code Playgroud)

呃哦！这会失败，因为它array_to_string不是 IMMUTABLE。有一些答案建议将 array_to_string 包装在不可变函数中：

CREATE FUNCTION my_array_to_string(arr ANYARRAY, sep TEXT) 
RETURNS text LANGUAGE SQL IMMUTABLE
AS $$
    SELECT array_to_string(arr, sep);
$$;

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', my_array_to_string(tags, ' ')) || to_tsvector('simple', title)));

Run Code Online (Sandbox Code Playgroud)

索引创建有效！但它永远不会被使用。上述 SELECT 上的 EXPLAIN ANALYZE 始终会导致顺序扫描。Postgres 显然太聪明了，无法应对这种欺骗。

Aggregate  (cost=4348818.79..4348818.80 rows=1 width=8) (actual time=107489.124..107489.125 rows=1 loops=1)
  ->  Seq Scan on book  (cost=0.00..4348543.45 rows=110135 width=0) (actual time=50.689..107477.408 rows=24641 loops=1)
        Filter: ((to_tsvector('simple'::regconfig, my_array_to_string(tags, ' '::text)) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
        Rows Removed by Filter: 5354819
Planning Time: 0.144 ms
Execution Time: 107489.157 ms

Run Code Online (Sandbox Code Playgroud)

我很困惑。有什么方法可以改善这一点吗？

新策略：使用`array_to_tsvector`.

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((array_to_tsvector(tags) || to_tsvector('simple', title)));

SELECT * 
FROM book 
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm');

Run Code Online (Sandbox Code Playgroud)

这有效！索引已用！它很快！

Bitmap Heap Scan on book  (cost=2005.04..76150.11 rows=26973 width=147) (actual time=5.281..425.128 rows=946 loops=1)
  Recheck Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''apple'''::tsquery)"
  Heap Blocks: exact=790
  ->  Bitmap Index Scan on book_fulltext_idx  (cost=0.00..1998.30 rows=26973 width=0) (actual time=4.468..4.468 rows=957 loops=1)
        Index Cond: ((array_to_tsvector(tags) || to_tsvector('simple'::regconfig, title)) @@ '''mysearchterm'''::tsquery)"
Planning Time: 0.113 ms
Execution Time: 425.371 ms

Run Code Online (Sandbox Code Playgroud)

但搜索的语义是有问题的。array_to_tsvector将数组解释为原始词位。to_tsquery去掉 $ 和破折号。这意味着带有美元符号或破折号的标签是无法搜索的。

-- This can NEVER match the tag `$mysearchterm`
SELECT * 
FROM book 
WHERE array_to_tsvector(tags) || to_tsvector('simple', title)
      @@ to_tsquery('simple', '$mysearchterm');

Run Code Online (Sandbox Code Playgroud)

有什么方法可以让它达到我想要的效果吗？好像我想要类似的东西array_to_tsvector('simple', tags)，但该功能不存在。

新策略：两个索引和 OR

CREATE INDEX book_tags_fulltext_idx
    ON book using GIN (array_to_tsvector(tags));

CREATE INDEX book_title_fulltext_idx
    ON book using GIN (to_tsvector('simple', title));

SELECT * 
FROM book 
WHERE array_to_tsvector(tags) @@ '$mysearchterm' OR to_tsvector('simple', title)
      @@ to_tsquery('simple', '$mysearchterm');

Run Code Online (Sandbox Code Playgroud)

这会在合理的时间内产生正确的答案，但会破坏搜索的语义。你无法搜索titleword tagword。WHERE 子句需要标签中的两个单词，或标题中的两个单词。没有布埃诺。

结语

看起来我要么需要弄清楚如何索引与标题连接的标签数组，要么以某种方式修改传递给 array_to_tsvector 的值。我不太确定如何做这两件事。有任何想法吗？

我们正在使用 PG11，但如果有什么不同的话我可以升级。

Answer 1

sti*_*ure 3

我找到了一个使索引正常工作的解决方案。我无法解释。

这不起作用：

CREATE INDEX book_fulltext_idx
    ON book using GIN 
        ((to_tsvector('simple', immutable_array_to_string(tags, ' ')) || to_tsvector('simple', title)));

SELECT * 
FROM book 
WHERE to_tsvector('simple', immutable_array_to_string(tags, ' ')) || to_tsvector('simple', title)
      @@ to_tsquery('simple', 'mysearchterm');

Run Code Online (Sandbox Code Playgroud)

然而这确实有效：

CREATE INDEX book_fulltext_idx
    ON book using GIN (to_tsvector('simple', title || ' ' || immutable_array_to_string(tags, ' ')));

SELECT * 
FROM book 
WHERE to_tsvector('simple', title || ' ' || immutable_array_to_string(tags, ' '))
      @@ to_tsquery('simple', 'mysearchterm');

Run Code Online (Sandbox Code Playgroud)

Aggregate  (cost=81092.50..81092.51 rows=1 width=8) (actual time=129.780..129.781 rows=1 loops=1)
  ->  Bitmap Heap Scan on book  (cost=296.49..81025.24 rows=26902 width=0) (actual time=1.990..129.519 rows=1576 loops=1)
        Recheck Cond: (to_tsvector('simple'::regconfig, ((title || ' '::text) || immutable_array_to_string(tags, ' '::text))) @@ '''mysearchterm'''::tsquery)
        Heap Blocks: exact=1302
        ->  Bitmap Index Scan on book_fulltext_idx  (cost=0.00..289.76 rows=26902 width=0) (actual time=1.605..1.606 rows=1576 loops=1)
              Index Cond: (to_tsvector('simple'::regconfig, ((title || ' '::text) || immutable_array_to_string(tags, ' '::text))) @@ '''mysearchterm'''::tsquery)
Planning Time: 0.509 ms
Execution Time: 129.906 ms

Run Code Online (Sandbox Code Playgroud)

我无法解释为什么查询规划器认为“采用连接字符串的向量”与“连接字符串的向量”不同，但你已经明白了。

归档时间：	4 年，3 月前
查看次数：	2446 次
最近记录：	4 年，3 月前

Postgres 在数组列上进行全文搜索，带索引

新策略：使用array_to_tsvector.

新策略：两个索引和 OR

结语

新策略：使用`array_to_tsvector`.