如何有效地从 Postgres 查询以选择特殊词？

Question

如何有效地从 Postgres 查询以选择特殊词？

Ali*_*eza 3 postgresql select pattern-matching unaccent

假设我有一个words包含非常多记录的表。
列是id和name。

在words我的表格中，例如：

 'systematic', '????','gear','synthesis','mysterious', etc.

Run Code Online (Sandbox Code Playgroud)

注意：我们也有 utf8 字样。
如何有效地查询，看看哪些词包含字母's'，'m'和'e'（所有的他们）？

输出将是：

systematic,mysterious

Run Code Online (Sandbox Code Playgroud)

我不知道如何做这样的事情。它应该是高效的，否则我们的服务器会受到影响。

Answer 1

Dan*_*ité 5

一种简单的方法是考虑与每个单词对应的字母数组，并使用@>（包含）数组运算符在其中搜索。如手册中的示例所示，这与字母位置无关，即为ARRAY[1,4,3] @> ARRAY[3,1]真。

这个数组可以很容易地获得regexp_split_to_array(name, '')。
[编辑：根据@Erwin 的回答，string_to_array(name, NULL)速度更快，所以更好地使用它。这是其余答案中的替代品]

这是一个演示，它首先将数组具体化为包含英语和法语单词（~511000 行，平均长度 = 13 个字符）的测试表中的列，然后是第二个测试表，而不将数组添加为列。

=> CREATE TABLE tstword AS
    SELECT word_id as id,
    wordtext as name,
    regexp_split_to_array(wordtext, '') as arr FROM words;

Run Code Online (Sandbox Code Playgroud)

要查找相对较多的单词：

=> select count(*) from tstword where arr @> array['s','m','e'];
 count 
-------
 42268
(1 row)

Time: 268.809 ms

Run Code Online (Sandbox Code Playgroud)

这会进行顺序扫描，如 EXPLAIN ANALYZE 所示：

 explain analyze select name from tstword where arr @> array['s','m','e'];
                                                   QUERY PLAN                                                   
----------------------------------------------------------------------------------------------------------------
 Seq Scan on tstword  (cost=0.00..17554.46 rows=21256 width=14) (actual time=0.020..268.525 rows=42268 loops=1)
   Filter: (arr @> '{s,m,e}'::text[])
   Rows Removed by Filter: 468729
 Total runtime: 269.927 ms
(4 rows)

Time: 270.414 ms

Run Code Online (Sandbox Code Playgroud)

但是我们可以使用 GIN 索引来索引数组：

CREATE INDEX idx_tst on tstword using gin(arr);

Run Code Online (Sandbox Code Playgroud)

然后它会更快：

explain analyze select name from tstword where arr @> array['s','m','e'];
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tstword  (cost=252.74..11815.73 rows=21256 width=14) (actual time=46.378..60.203 rows=42268 loops=1)
   Recheck Cond: (arr @> '{s,m,e}'::text[])
   ->  Bitmap Index Scan on idx_tst  (cost=0.00..247.42 rows=21256 width=0) (actual time=45.202..45.202 rows=42268 loops=1)
         Index Cond: (arr @> '{s,m,e}'::text[])
 Total runtime: 61.677 ms
(5 rows)

Time: 70.185 ms

Run Code Online (Sandbox Code Playgroud)

我们甚至可以通过直接索引表达式来避免将数组具体化为列，因为 postgres 支持函数式索引。

create table tstword2 as select word_id as id,wordtext as name from words;
create index idx_tst2 on tstword2  using gin(regexp_split_to_array(name, ''));

Run Code Online (Sandbox Code Playgroud)

然后必须使用完全相同的表达式进行搜索，并使用索引：

 explain analyze select name from tstword2 where regexp_split_to_array(name, '') @> array['s','m','e'];
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on tstword2  (cost=40.00..44.02 rows=1 width=14) (actual time=39.390..48.435 rows=42268 loops=1)
   Recheck Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
   ->  Bitmap Index Scan on idx_tst2  (cost=0.00..40.00 rows=1 width=0) (actual time=39.053..39.053 rows=42268 loops=1)
         Index Cond: (regexp_split_to_array((name)::text, ''::text) @> '{s,m,e}'::text[])
 Total runtime: 49.748 ms
(5 rows)

Time: 50.193 ms

Run Code Online (Sandbox Code Playgroud)

有关这些类型的索引的注意事项，请参阅手册中的GiST 和 GIN 索引类型。

归档时间：	12 年，5 月前
查看次数：	3178 次
最近记录：	10 年，4 月前