rdb*_*oob 5 postgresql index regular-expression regex
我有一个查询,旨在循环和搜索重复地址,该查询使用 REGEX_REPLACE。我正在尝试在正则表达式上建立索引,就像进行解释一样,并且它使用正则表达式上的过滤器对 user_property 表进行顺序扫描
EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) with user_detail AS (
SELECT user_id,
max(user_property_value) FILTER (WHERE user_property_type_id = 6 ) AS FIRST_NAME,
max(user_property_value) FILTER (WHERE user_property_type_id = 7 ) AS LAST_NAME,
max(TO_DATE(user_property_value, 'YYYY-MM-DD')) FILTER (WHERE user_property_type_id = 8 ) AS DOB,
max(user_property_value) FILTER (WHERE user_property_type_id = 33 ) AS BIRTH_NUMBER
FROM PUBLIC.user_property cp
JOIN PUBLIC.user c using (user_id)
WHERE c.user_group_id= '38'
AND cp.user_property_is_active
GROUP BY user_id
),
duplicate as (
SELECT COALESCE(MAX(
CASE WHEN REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
THEN 1 ELSE 0 END), 0) AS dup_name_address,
COALESCE(MAX(CASE WHEN REGEXP_REPLACE(UPPER(address_postcode), E'\\_|\\W','','g') = 'WD17 1JY' THEN 1 ELSE 0 END), 0) AS dup_name_postcode
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
(REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Len')
OR
(REGEXP_REPLACE(UPPER(cd.LAST_NAME), E'\\_|\\W', '', 'g') = 'Clyde'
AND REGEXP_REPLACE(UPPER(cd.FIRST_NAME), E'\\_|\\W', '', 'g') = 'Len')
)
AND cd.user_id != '2589384'
), dup_dob_address AS (
SELECT
COALESCE(MAX(CASE WHEN
(cd.DOB IS NOT NULL AND cd.DOB = '1982-06-14 00:00:00') OR (cd.BIRTH_NUMBER IS NOT NULL AND cd.BIRTH_NUMBER = null )
THEN 1 ELSE 0 END), 0) AS dob
FROM
user_detail cd
LEFT JOIN PUBLIC.address ad ON cd.user_id = ad.user_id
WHERE (
REGEXP_REPLACE((address_line1), E'\\_|\\W','','g') = 'Flat 25 Arliss Court 24'
AND (
COALESCE(REGEXP_REPLACE((address_line2), E'\\_|\\W','','g'), '') = ''
OR REGEXP_REPLACE((address_line2), E'\\_|\\W','','g') = 'Calderon Road'
)
AND REGEXP_REPLACE((address_place), E'\\_|\\W','','g') = 'Dartford'
AND address_country_code = 'GB'
)
AND cd.user_id != '2589384'
)
SELECT * FROM duplicate, dup_dob_address;
Run Code Online (Sandbox Code Playgroud)
解释结果:
Nested Loop (cost=492738.45..492738.50 rows=1 width=12) (actual time=7589.136..7590.933 rows=1 loops=1)
Output: (COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)), (COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0))
Buffers: shared hit=931500 read=103761
CTE user_detail
-> Finalize HashAggregate (cost=423105.99..426854.87 rows=374888 width=104) (actual time=6110.633..6172.107 rows=115625 loops=1)
Output: cp.user_id, max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
-> Gather (cost=335007.31..413733.79 rows=749776 width=104) (actual time=6024.383..6062.501 rows=115625 loops=1)
Output: cp.user_id, (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7))), (PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8))), (PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33)))
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=908203 read=103761
-> Partial HashAggregate (cost=334007.31..337756.19 rows=374888 width=104) (actual time=6017.847..6037.215 rows=38542 loops=3)
Output: cp.user_id, PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 6)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 7)), PARTIAL max(to_date((cp.user_property_value)::text, 'YYYY-MM-DD'::text)) FILTER (WHERE (cp.user_property_type_id = 8)), PARTIAL max((cp.user_property_value)::text) FILTER (WHERE (cp.user_property_type_id = 33))
Group Key: cp.user_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=6017.372..6035.986 rows=37261 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=6012.321..6032.378 rows=40788 loops=1
Buffers: shared hit=320593 read=35787
-> Nested Loop (cost=1630.78..321001.76 rows=520222 width=30) (actual time=48.770..5900.888 rows=434730 loops=3)
Output: cp.user_id, cp.user_property_value, cp.user_property_type_id
Buffers: shared hit=908203 read=103761
Worker 0: actual time=45.466..5905.504 rows=420402 loops=1
Buffers: shared hit=292969 read=33275
Worker 1: actual time=44.758..5889.927 rows=459654 loops=1
Buffers: shared hit=320593 read=35787
-> Parallel Bitmap Heap Scan on public.user c (cost=1630.22..22201.58 rows=48268 width=4) (actual time=26.536..39.410 rows=38542 loops=3)
Output: c.user_id, c.currency_code, c.user_group_id, c.user_created_on, c.user_status_id, c.user_max_credit, c.user_last_updated_on, c.user_version
Recheck Cond: (c.user_group_id = 38)
Heap Blocks: exact=2249
Buffers: shared hit=6896 read=319
Worker 0: actual time=22.735..35.486 rows=37261 loops=1
Buffers: shared hit=2303
Worker 1: actual time=22.766..36.418 rows=40788 loops=1
Buffers: shared hit=2343
-> Bitmap Index Scan on idx_user_user_group_id (cost=0.00..1601.26 rows=115844 width=0) (actual time=33.224..33.224 rows=115625 loops=1)
Index Cond: (c.user_group_id = 38)
Buffers: shared hit=1 read=319
-> Index Scan using idx_user_id_user_property on public.user_property cp (cost=0.56..5.51 rows=68 width=30) (actual time=0.036..0.150 rows=11 loops=115625)
Output: cp.user_id, cp.user_property_type_id, cp.user_property_created_on, cp.user_property_is_active, cp.user_property_value, cp.user_property_upper_value, cp.user_property_version
Index Cond: (cp.user_id = c.user_id)
Buffers: shared hit=901307 read=103442
Worker 0: actual time=0.038..0.156 rows=11 loops=37261
Buffers: shared hit=290666 read=33275
Worker 1: actual time=0.034..0.142 rows=11 loops=40788
Buffers: shared hit=318250 read=35787
-> Aggregate (cost=19766.95..19766.96 rows=1 width=8) (actual time=6882.602..6882.605 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN ((regexp_replace((ad.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND ((COALESCE(regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)) AND (regexp_replace((ad.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((ad.address_country_code)::text = 'GB'::text)) THEN 1 ELSE 0 END), 0), COALESCE(max(CASE WHEN (regexp_replace(upper((ad.address_postcode)::text), '\\_|\\W'::text, ''::text, 'g'::text) = 'WD17 1JY'::text) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=908203 read=103761
-> Nested Loop Left Join (cost=0.42..19766.22 rows=21 width=110) (actual time=6882.596..6882.597 rows=0 loops=1)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode
Buffers: shared hit=908203 read=103761
-> CTE Scan on user_detail cd (cost=0.00..19681.62 rows=19 width=4) (actual time=6882.595..6882.595 rows=0 loops=1)
Output: cd.user_id, cd.first_name, cd.last_name, cd.dob, cd.birth_number
Filter: ((cd.user_id <> 2589384) AND (((regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text)) OR ((regexp_replace(upper(cd.last_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Clyde'::text) AND (regexp_replace(upper(cd.first_name), '\\_|\\W'::text, ''::text, 'g'::text) = 'Len'::text))))
Rows Removed by Filter: 115625
Buffers: shared hit=908203 read=103761
-> Index Scan using address_idx_01 on public.address ad (cost=0.42..4.44 rows=1 width=114) (never executed)
Output: ad.address_line1, ad.address_line2, ad.address_place, ad.address_country_code, ad.address_postcode, ad.user_id
Index Cond: (ad.user_id = cd.user_id)
-> Aggregate (cost=46116.63..46116.64 rows=1 width=4) (actual time=706.525..707.941 rows=1 loops=1)
Output: COALESCE(max(CASE WHEN (((cd_1.dob IS NOT NULL) AND (cd_1.dob = '1982-06-14'::date)) OR ((cd_1.birth_number IS NOT NULL) AND NULL::boolean)) THEN 1 ELSE 0 END), 0)
Buffers: shared hit=23297
-> Hash Join (cost=36282.83..46116.62 rows=1 width=36) (actual time=706.520..707.934 rows=0 loops=1)
Output: cd_1.dob, cd_1.birth_number
Hash Cond: (cd_1.user_id = ad_1.user_id)
Buffers: shared hit=23297
-> CTE Scan on user_detail cd_1 (cost=0.00..8434.98 rows=373014 width=40) (actual time=0.002..0.003 rows=1 loops=1)
Output: cd_1.user_id, cd_1.first_name, cd_1.last_name, cd_1.dob, cd_1.birth_number
Filter: (cd_1.user_id <> 2589384)
-> Hash (cost=36282.82..36282.82 rows=1 width=4) (actual time=706.499..707.911 rows=0 loops=1)
Output: ad_1.user_id
Buckets: 1024 Batches: 1 Memory Usage: 8kB
Buffers: shared hit=23297
-> Gather (cost=1000.00..36282.82 rows=1 width=4) (actual time=706.496..707.907 rows=0 loops=1)
Output: ad_1.user_id
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=23297
-> Parallel Seq Scan on public.address ad_1 (cost=0.00..35282.72 rows=1 width=4) (actual time=701.969..701.970 rows=0 loops=3)
Output: ad_1.user_id
Filter: (((ad_1.address_country_code)::text = 'GB'::text) AND (regexp_replace((ad_1.address_line1)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Flat 25 Arliss Court 24'::text) AND (regexp_replace((ad_1.address_place)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Dartford'::text) AND ((COALESCE(regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text), ''::text) = ''::text) OR (regexp_replace((ad_1.address_line2)::text, '\\_|\\W'::text, ''::text, 'g'::text) = 'Calderon Road'::text)))
Rows Removed by Filter: 295033
Buffers: shared hit=23297
Worker 0: actual time=699.642..699.644 rows=0 loops=1
Buffers: shared hit=7331
Worker 1: actual time=700.498..700.499 rows=0 loops=1
Buffers: shared hit=7984
Planning Time: 17.292 ms
Execution Time: 7601.934 ms
Run Code Online (Sandbox Code Playgroud)
https://explain.depesz.com/s/cbmv#html
我查看了一篇关于使用 pg_trgm 扩展的类似文章,但在尝试索引时没有任何区别。
create index concurrently on address using gin (address_place gin_trgm_ops);
Run Code Online (Sandbox Code Playgroud)
但
user_property 表的大小约为 250 万行,地址表的大小也非常小 < 170 万行。
有没有一种有效的方法来索引 Regex_replace?或者是否需要重新设计查询?
非常感谢任何帮助。
我发现这是一个有趣的问题+1!
\n为了回答这个问题,我沿着老式路线做了一些测试。
\n下面的所有代码都可以在此处的小提琴上找到。所采用的策略使用GENERATED
列(手动)。您还可以使用表达式(又名函数)索引 - 请参阅本答案底部的注释)。
CREATE TABLE test\n(\n t_id INTEGER NOT NULL PRIMARY KEY GENERATED ALWAYS AS IDENTITY,\n address TEXT,\n post_code TEXT,\n add_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(address, \'Building\', \'BUILDING\')) STORED,\n p_c_bis TEXT GENERATED ALWAYS AS (REGEXP_REPLACE(post_code, \'abc\', \'ABC\')) STORED\n);\n
Run Code Online (Sandbox Code Playgroud)\n并填充一些记录:
\nINSERT INTO test (address, post_code) VALUES\n(\'The Building, Apt 13, Flr 6\', \'abc123\'),\n(\'The Building, Apt 45, Flr 8\', \'abc456\'),\n(\'The Building, Apt 45, Flr 9\', \'abc789\');\n
Run Code Online (Sandbox Code Playgroud)\n现在,秘密武器 - 索引这些GENERATED
列:
CREATE INDEX my_field_regexp_idx ON test (add_bis);\n\nCREATE INDEX post_code_regexp_idx ON test(p_c_bis);\n
Run Code Online (Sandbox Code Playgroud)\n只是为了检查(总是检查):
\nSELECT * FROM test;\n
Run Code Online (Sandbox Code Playgroud)\n结果:
\nt_id address post_code add_bis p_c_bis\n1 The Building, Apt 13, Flr 6 abc123 The BUILDING, Apt 13, Flr 6 ABC123\n2 The Building, Apt 45, Flr 8 abc456 The BUILDING, Apt 45, Flr 8 ABC456\n3 The Building, Apt 45, Flr 9 abc789 The BUILDING, Apt 45, Flr 9 ABC789\n
Run Code Online (Sandbox Code Playgroud)\n首先,我们运行这个:
\nSET enable_seqscan = OFF;\n
Run Code Online (Sandbox Code Playgroud)\n这实际上并没有禁用顺序表扫描,它只是使它们变得非常昂贵 - 请参阅下面的讨论。
\n不要在生产系统上执行此操作,或者至少不要在全球范围内执行此操作。当且仅当您完全理解任何后果时,您可以根据具体情况、逐个查询来执行此操作,但不建议这样做。今天的查询提示是明天的错误 - 请谨慎使用。
\n我在这里这样做的原因是强制优化器选择索引而不是顺序扫描。如果没有enable_seqscan = OFF
,这里非常小的样本表将导致优化器自动选择顺序扫描。由于生产系统上有大量记录,这应该不成问题。
从这里的文档:
\n\n\n\n
\n- \n
启用_seqscan(布尔值)
\n启用或禁用查询计划程序对顺序扫描计划类型的使用。
\nIt is impossible to suppress sequential scans entirely
,\n但是关闭此变量会阻止计划者使用一个\n如果有其他方法可用。默认开启。
[强调我的] - 另外,请参阅下面的讨论。
\n然后我们运行:
\nEXPLAIN (ANALYZE, BUFFERS, VERBOSE)\nSELECT \n *\nFROM\n test\nWHERE add_bis LIKE \'The BUILDING\';\n
Run Code Online (Sandbox Code Playgroud)\n结果:
\nQUERY PLAN\nIndex Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)\n\xe2\x80\x87\xe2\x80\x87Output: t_id, address, post_code, add_bis, p_c_bis\n\xe2\x80\x87\xe2\x80\x87Index Cond: (test.add_bis = \'The BUILDING\'::text)\n\xe2\x80\x87\xe2\x80\x87Filter: (test.add_bis ~~ \'The BUILDING\'::text)\n\xe2\x80\x87\xe2\x80\x87Buffers: shared read=1\nPlanning:\n\xe2\x80\x87\xe2\x80\x87Buffers: shared hit=22\nPlanning Time: 0.150 ms\nExecution Time: 0.042 ms\n
Run Code Online (Sandbox Code Playgroud)\n非常好 - 我们想要的结果 - Index Scan
:
Index Scan using address_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=132) (actual time=0.022..0.022 rows=0 loops=1)\n
Run Code Online (Sandbox Code Playgroud)\n然后,我们运行:
\nEXPLAIN (ANALYZE, VERBOSE, BUFFERS)\nSELECT\n p_c_bis\nFROM test\nWHERE p_c_bis = \'ABC123\';\n
Run Code Online (Sandbox Code Playgroud)\nREGEXP_REPLACE()
其中请求的数据完全包含在( ed 列的)索引中,等voil\xc3\xa0:
Index Only Scan using post_code_regexp_idx on public.test (cost=0.13..8.15 rows=1 width=32) (actual time=0.039..0.040 rows=1 loops=1)\n\xe2\x80\x87\xe2\x80\x87Output: p_c_bis\n\xe2\x80\x87\xe2\x80\x87Index Cond: (test.p_c_bis = \'ABC123\'::text)\n\xe2\x80\x87\xe2\x80\x87Heap Fetches: 1\n\xe2\x80\x87\xe2\x80\x87Buffers: shared hit=1 read=1\nPlanning Time: 0.043 ms\nExecution Time: 0.064 ms\n
Run Code Online (Sandbox Code Playgroud)\n我们有一个Index Only Scan
我们想要的!Index Only Scan
小提琴中还有另一个例子。
有趣的是,当我重新运行这个(最后)时:
\nEXPLAIN (ANALYZE, VERBOSE, BUFFERS)\nSELECT\n *\nFROM test\nWHERE add_bis LIKE \'The BUILDING%\';\n
Run Code Online (Sandbox Code Playgroud)\n结果:
\nSeq Scan on public.test (cost=10000000000.00..10000000001.04 rows=1 width=132) (actual time=0.008..0.009 rows=3 loops=1)\n\xe2\x80\x87\xe2\x80\x87Output: t_id, address, post_code, add_bis, p_c_bis\n\xe2\x80\x87\xe2\x80\x87Filter: (test.add_bis ~~ \'The BUILDING%\'::text)\n\xe2\x80\x87\xe2\x80\x87Buffers: shared hit=1\nPlanning Time: 0.038 ms\nExecution Time: 0.027 ms\n
Run Code Online (Sandbox Code Playgroud)\n我只能推测(“猜测”的好词!)此时,优化器知道该表位于内存缓冲区中,并且Seq Scan
无论如何 a 将是最快/成本最低的选项(请参阅非常高的选项)成本 (cost=100000000000
))。这与上面的文档非常吻合(“完全抑制顺序扫描是不可能的”)。
我以前见过这个,为什么它在我的第一个查询中使用索引而不是在第二个查询中使用索引对我来说是一个谜 - 恐怕源代码有点高于我的工资等级。
\n\n\n有没有一种有效的方法来索引 Regex_replace?
\n
如果您使用列,策略REGEXP_REPLACE()
就会起作用GENERATED
。
1.我也尝试过expression indexes
(手动),即没有GENERATED
列,语法如下:
CREATE INDEX address_regexp_idx ON test (REGEXP_REPLACE(address, \'Building\', \'BUILDING\'));`\n
Run Code Online (Sandbox Code Playgroud)\n但在任何一种情况下我都无法进行任何Index Scan
工作(在小提琴上)-YMMV。我强烈建议您也检查一下这个策略 - 如果额外的空间权衡是否适合您,您可能会也可能不会达到您想要的结果。
2.请始终在您的问题中包含您的 PostgreSQL 版本。
\n3.您的查询中的表的 DDLcd
可能很有用 - 我刚刚遵循了一般原则,但根据经验,您在问题中包含的信息越多越好。
4.您的查询包含以下行(例如):
\nCASE WHEN REGEXP_REPLACE((address_line1), E\'\\\\_|\\\\W\',\'\',\'g\') = \'Flat 25 Arliss Court 24\'`\n
Run Code Online (Sandbox Code Playgroud)\n和
\nOR REGEXP_REPLACE((address_line2), E\'\\\\_|\\\\W\',\'\',\'g\') = \'Calderon Road\'`\n
Run Code Online (Sandbox Code Playgroud)\n这些条件永远无法满足。
\n看小提琴的结尾:
\nSELECT REGEXP_REPLACE((\'_Flat 25 Arl_iss Co__urt 24\'), E\'\\\\_|\\\\W\',\'\',\'g\'); \n
Run Code Online (Sandbox Code Playgroud)\n结果:
\nregexp_replace\nFlat25ArlissCourt24 -- != Flat 25 Arliss Court 24\n
Run Code Online (Sandbox Code Playgroud)\n因此,不带空格的字符串永远不可能等于带空格的字符串。
\n5.最后,您使用的是过时的 E\'\\\\_|\\\\W\'
符号 - 更清晰的是这样的\'_|\\W\'
。不需要转义_
(下划线)字符 - 它在正则表达式中没有特殊含义,所有这些反斜杠使正则表达式难以阅读(恕我直言 - 正则表达式本身就够糟糕的了!)。