postgres 上的单词匹配的正则表达式性能较差

Question

postgres 上的单词匹配的正则表达式性能较差

Abd*_*gui 3 regex sql postgresql postgresql-9.3

我有一个被阻止的短语列表，我想匹配用户输入的文本中是否存在这些短语，但性能非常糟糕。

我正在使用这个查询：

SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text' )) ~* ('[[:<:]]' || value || '[[:>:]]') LIMIT 1;

Run Code Online (Sandbox Code Playgroud)

经过我的调查，我发现世界边界[[:<:]]和[[:>:]]执行非常糟糕，因为知道blocked_items 有 24k 条记录。

例如，当我尝试运行这个时：

SELECT value FROM blocked_items WHERE lower(unaccent( 'my input text ' )) ilike ('%' || value || '%') LIMIT 1;

Run Code Online (Sandbox Code Playgroud)

与第一个相比，它非常快。问题是我需要保留单词边界的测试。

此检查在大型程序中频繁执行，因此性能对我来说非常重要。

你们有什么建议可以让这个更快吗？

解释分析屏幕截图

Answer 1

fel*_*ann 5

由于您知道LIKE( ~~) 查询速度快而 RegEx ( ~) 查询速度慢，因此最简单的解决方案是将这两个条件结合起来（/\\m相当于\\M/ ）：[[:<:]][[:>:]]

\n

SELECT value FROM blocked_items\nWHERE lower(unaccent(\'my input text\')) ~~ (\'%\'||value||\'%\')\n  AND lower(unaccent(\'my input text\')) ~ (\'\\m\'||value||\'\\M\')\nLIMIT 1;\n

Run Code Online (Sandbox Code Playgroud)\n

这样，快速查询条件会过滤掉大部分行，然后慢速查询条件会丢弃剩余的行。

\n

我正在使用更快的区分大小写的运算符，假设它value已经标准化。如果不是这种情况，请删除（然后是多余的）lower()并使用区分大小写的版本，就像在原始查询中一样。

\n

在我的370k行测试集上，查询速度从6 秒（热）加速到90 毫秒：

\n

                                                                                       QUERY PLAN\n----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n Limit  (cost=0.00..1651.85 rows=1 width=10) (actual time=89.702..89.702 rows=1 loops=1)\n   ->  Seq Scan on blocked_items  (cost=0.00..14866.61 rows=9 width=10) (actual time=89.701..89.701 rows=1 loops=1)\n         Filter: ((lower(unaccent(\'my input text\'::text)) ~~ ((\'%\'::text || value) || \'%\'::text)) AND (lower(unaccent(\'my input text\'::text)) ~ ((\'\\m\'::text || value) || \'\\M\'::text)))\n         Rows Removed by Filter: 153281\n Planning Time: 0.097 ms\n Execution Time: 89.717 ms\n(6 rows)\n

Run Code Online (Sandbox Code Playgroud)\n

然而，我们仍在进行全表扫描，性能会根据表中的位置而有所不同。

\n

理想情况下，我们可以使用索引在接近恒定的时间内回答查询。

\n

让我们重写查询以使用文本搜索函数和运算符：

\n

SELECT value FROM blocked_items\nWHERE to_tsvector(\'simple\', unaccent(\'my input text\'))\n   @@ phraseto_tsquery(\'simple\', value)\nLIMIT 1;\n

Run Code Online (Sandbox Code Playgroud)\n

首先，我们将输入拆分为搜索向量，然后检查被阻止的短语是否与这些向量匹配。

\n

对于测试查询 \xe2\x80\x93 来说，这需要大约440 毫秒，这比我们的组合查询要慢：

\n

                                                             QUERY PLAN\n-------------------------------------------------------------------------------------------------------------------------------------\n Limit  (cost=0.00..104.01 rows=1 width=10) (actual time=437.761..437.761 rows=1 loops=1)\n   ->  Seq Scan on blocked_items  (cost=0.00..192516.05 rows=1851 width=10) (actual time=437.760..437.760 rows=1 loops=1)\n         Filter: (to_tsvector(\'simple\'::regconfig, unaccent(\'my input text\'::text)) @@ phraseto_tsquery(\'simple\'::regconfig, value))\n         Rows Removed by Filter: 153281\n Planning Time: 0.063 ms\n Execution Time: 437.772 ms\n(6 rows)\n

Run Code Online (Sandbox Code Playgroud)\n

由于我们不能使用tsvector @@ tsquery的索引，我们可以再次重写查询，以使用文本搜索运算符tsquery检查被阻止的短语是否是输入短语的子查询，然后可以使用GiST 运算符类对其进行索引：tsquery @> tsquery tsquery_ops

\n

CREATE INDEX blocked_items_search ON blocked_items\n  USING gist (phraseto_tsquery(\'simple\', value));\n\nANALYZE blocked_items; -- update query planner stats\n\nSELECT value FROM blocked_items\nWHERE phraseto_tsquery(\'simple\', unaccent(\'my input text\'))\n   @> phraseto_tsquery(\'simple\', value)\nLIMIT 1;\n

Run Code Online (Sandbox Code Playgroud)\n

该查询现在可以使用索引扫描，对于相同的数据需要20 毫秒。

\n

由于 GiST 是有损索引，因此查询时间可能会有所不同，具体取决于需要重新检查的次数：

\n

                                                                QUERY PLAN\n--------------------------------------------------------------------------------------------------------------------------------------------------\n Limit  (cost=0.54..4.23 rows=1 width=10) (actual time=19.215..19.215 rows=1 loops=1)\n   ->  Index Scan using blocked_items_search on blocked_items  (cost=0.54..1367.01 rows=370 width=10) (actual time=19.214..19.214 rows=1 loops=1)\n         Index Cond: (phraseto_tsquery(\'simple\'::regconfig, value) <@ phraseto_tsquery(\'simple\'::regconfig, unaccent(\'my input text\'::text)))\n         Rows Removed by Index Recheck: 4028\n Planning Time: 0.093 ms\n Execution Time: 19.236 ms\n(6 rows)\n

Run Code Online (Sandbox Code Playgroud)\n

使用全文搜索的一大优势是，您现在可以通过搜索配置 ( regconfig) 使用特定于语言的单词匹配。

\n

上述查询均使用默认的\'simple\'regconfig 来匹配原始查询的行为。通过切换到，\'english\'您还可以匹配同一单词的变体，例如cat和cats（词干提取）以及没有意义的常见单词，例如the或my will beigned 被忽略（停用词）。

\n

归档时间：	7 年，7 月前
查看次数：	1083 次
最近记录：	5 年，3 月前