use*_*752 9 sql postgresql query-optimization
采取以下两个表:
Table "public.contacts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------+-----------------------------+-------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('contacts_id_seq'::regclass) | plain | |
created_at | timestamp without time zone | not null | plain | |
updated_at | timestamp without time zone | not null | plain | |
external_id | integer | | plain | |
email_address | character varying | | extended | |
first_name | character varying | | extended | |
last_name | character varying | | extended | |
company | character varying | | extended | |
industry | character varying | | extended | |
country | character varying | | extended | |
region | character varying | | extended | |
ext_instance_id | integer | | plain | |
title | character varying | | extended | |
Indexes:
"contacts_pkey" PRIMARY KEY, btree (id)
"index_contacts_on_ext_instance_id_and_external_id" UNIQUE, btree (ext_instance_id, external_id)
Run Code Online (Sandbox Code Playgroud)
和
Table "public.members"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------+-----------------------------+--------------------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('members_id_seq'::regclass) | plain | |
step_id | integer | | plain | |
contact_id | integer | | plain | |
rule_id | integer | | plain | |
request_id | integer | | plain | |
sync_id | integer | | plain | |
status | integer | not null default 0 | plain | |
matched_targeted_rule | boolean | default false | plain | |
external_fields | jsonb | | extended | |
imported_at | timestamp without time zone | | plain | |
campaign_id | integer | | plain | |
ext_instance_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
Indexes:
"members_pkey" PRIMARY KEY, btree (id)
"index_members_on_contact_id_and_step_id" UNIQUE, btree (contact_id, step_id)
"index_members_on_campaign_id" btree (campaign_id)
"index_members_on_step_id" btree (step_id)
"index_members_on_sync_id" btree (sync_id)
"index_members_on_request_id" btree (request_id)
"index_members_on_status" btree (status)
Run Code Online (Sandbox Code Playgroud)
主键和主键都存在指数members.contact_id.
我需要删除任何contact没有相关的内容members.大约有3MM contact和25MM的member记录.
我正在尝试以下两个查询:
DELETE FROM "contacts"
WHERE "contacts"."id" IN (SELECT "contacts"."id"
FROM "contacts"
LEFT OUTER JOIN members
ON
members.contact_id = contacts.id
WHERE members.id IS NULL);
DELETE 0
Time: 173033.801 ms
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.354..188717.354 rows=0 loops=1)
-> Nested Loop (cost=2654306.79..2654307.86 rows=1 width=18) (actual time=188717.351..188717.351 rows=0 loops=1)
-> HashAggregate (cost=2654306.36..2654306.37 rows=1 width=16) (actual time=188717.349..188717.349 rows=0 loops=1)
Group Key: contacts_1.id
-> Hash Right Join (cost=161177.46..2654306.36 rows=1 width=16) (actual time=188717.345..188717.345 rows=0 loops=1)
Hash Cond: (members.contact_id = contacts_1.id)
Filter: (members.id IS NULL)
Rows Removed by Filter: 26725870
-> Seq Scan on members (cost=0.00..1818698.96 rows=25322396 width=14) (actual time=0.043..160226.686 rows=26725870 loops=1)
-> Hash (cost=105460.65..105460.65 rows=3205265 width=10) (actual time=1962.612..1962.612 rows=3196180 loops=1)
Buckets: 262144 Batches: 4 Memory Usage: 34361kB
-> Seq Scan on contacts contacts_1 (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.011..950.657 rows=3196180 loops=1)
-> Index Scan using contacts_pkey on contacts (cost=0.43..1.48 rows=1 width=10) (never executed)
Index Cond: (id = contacts_1.id)
Planning time: 0.488 ms
Execution time: 188718.862 ms
Run Code Online (Sandbox Code Playgroud)
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
DELETE 0
Time: 170871.219 ms
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.034..177523.034 rows=0 loops=1)
-> Hash Anti Join (cost=2258873.91..2954594.50 rows=1895601 width=12) (actual time=177523.029..177523.029 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105460.65 rows=3205265 width=10) (actual time=0.018..1068.357 rows=3196180 loops=1)
-> Hash (cost=1818698.96..1818698.96 rows=25322396 width=10) (actual time=169587.802..169587.802 rows=26725870 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 36228kB
-> Seq Scan on members c (cost=0.00..1818698.96 rows=25322396 width=10) (actual time=0.052..160081.880 rows=26725870 loops=1)
Planning time: 0.901 ms
Execution time: 177524.526 ms
Run Code Online (Sandbox Code Playgroud)
正如您所看到的,即使删除任何记录,两个查询都会显示相似的性能,大约需要3分钟.
服务器的磁盘I/O峰值为100%,因此我假设数据,因为顺序扫描上都做的是被泼到磁盘contacts和members.
服务器是EC2 r3.large(15GB RAM).
有关如何优化此查询的任何想法?
运行vacuum analyze两个表并确保enable_mergejoin设置为on查询时间没有区别后:
DELETE FROM contacts
WHERE NOT EXISTS (SELECT 1
FROM members c
WHERE c.contact_id = contacts.id);
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on contacts (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.342..209406.342 rows=0 loops=1)
-> Hash Anti Join (cost=2246088.17..2966677.08 rows=1875003 width=12) (actual time=209406.338..209406.338 rows=0 loops=1)
Hash Cond: (contacts.id = c.contact_id)
-> Seq Scan on contacts (cost=0.00..105683.28 rows=3227528 width=10) (actual time=0.008..1010.643 rows=3227462 loops=1)
-> Hash (cost=1814029.74..1814029.74 rows=24855474 width=10) (actual time=198054.302..198054.302 rows=27307060 loops=1)
Buckets: 262144 Batches: 32 Memory Usage: 37006kB
-> Seq Scan on members c (cost=0.00..1814029.74 rows=24855474 width=10) (actual time=1.132..188654.555 rows=27307060 loops=1)
Planning time: 0.328 ms
Execution time: 209408.040 ms
Run Code Online (Sandbox Code Playgroud)
PG版本:
PostgreSQL 9.4.4 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo Hardened 4.5.4 p1.0, pie-0.4.7) 4.5.4, 64-bit
Run Code Online (Sandbox Code Playgroud)
关系大小:
Table | Size | External Size
-----------------------+---------+---------------
members | 23 GB | 11 GB
contacts | 944 MB | 371 MB
Run Code Online (Sandbox Code Playgroud)
设置:
work_mem
----------
64MB
random_page_cost
------------------
4
Run Code Online (Sandbox Code Playgroud)
尝试批量执行此操作似乎并没有帮助I/O使用(仍然达到100%)并且尽管使用基于索引的计划,但似乎并没有按时完善.
DO $do$
BEGIN
FOR i IN 57..668
LOOP
DELETE
FROM contacts
WHERE contacts.id IN
(
SELECT contacts.id
FROM contacts
left outer join members
ON members.contact_id = contacts.id
WHERE members.id IS NULL
AND contacts.id >= (i * 10000)
AND contacts.id < ((i+1) * 10000));
END LOOP;END $do$;
Run Code Online (Sandbox Code Playgroud)
之后我不得不终止查询,Time: 1203492.326 ms并且在查询运行的整个时间内磁盘I/O保持在100%.我还尝试了1,000和5,000块,但没有看到任何性能提升.
注意:使用57..668范围是因为我知道这些是现有的联系人ID.(例如min(id)和max(id))
\n\n关于我可以做什么来优化这个查询有什么想法吗?
\n
你的查询很完美。我会使用该NOT EXISTS变体。
\n你的索引index_members_on_contact_id_and_step_id也有好处:
但请参阅下文有关 BRIN 指数的内容。
\n您可以调整服务器、表和索引配置。
\n由于您几乎不更新或删除任何行,根据您的评论,请重点关注优化读取性能。
\n您提供:
\n\n\n服务器是 EC2 r3.large(15GB RAM)。
\n
和:
\n\n\nPostgreSQL 9.4.4
\n
你的版本已经严重过时了。至少升级到最新的小版本。更好的是,升级到当前的主要版本。Postgres 9.5 和 9.6 为大数据带来了重大改进——这正是您所需要的。
\n\n\n在基本顺序扫描中,预期行数与实际行数之间存在意外的 10% 不匹配:
\n\n\n对成员 c 进行 Seq 扫描(成本=0.00..1814029.74行=24855474宽度=10)(实际时间=1.132..188654.555行=27307060循环=1)
\n
一点也不引人注目,但仍然不应该出现在这个查询中。表示您可能需要调整autovacuum设置 - 对于非常大的表,可能需要调整设置。
问题比较多:
\n\n\n哈希反连接(成本=2246088.17..2966677.08行=1875003宽度=12)(实际时间=209406.338..209406.338行=0循环=1)
\n
Postgres 期望找到 1875003 行要删除,而实际上找到 0 行。这是出乎意料的。members.contact_id也许大幅增加和的统计目标contacts.id可以帮助缩小差距,这可能会允许更好的查询计划。看:
您的 ~ 25MM 行members占用 23 GB - 每行几乎 1kb,这对于您提供的表定义来说似乎过多(即使您提供的总大小应包括索引):
4 bytes item identifier\n\n24 tuple header\n 8 null bitmap\n36 9x integer\n16 2x ts\n 1 1x bool\n?? 1x jsonb\nRun Code Online (Sandbox Code Playgroud)\n看:
\n每行 89 个字节 - 或者更少,带有一些 NULL 值 - 几乎没有任何对齐填充,因此最大 96 个字节,加上您的jsonb列。
要么该jsonb列非常大,这会让我建议将数据规范化为单独的列或单独的表。考虑:
或者你的桌子很臃肿,这可以用 或 来解决VACUUM FULL ANALYZE,同时在它上面:
CLUSTER members USING index_members_on_contact_id_and_step_id;\nVACUUM members;\nRun Code Online (Sandbox Code Playgroud)\n但要么在表上获得独占锁,你说你负担不起。pg_repack无需独占锁即可做到。看:
即使我们考虑索引大小,您的表似乎太大了:您有 7 个小索引,每个索引每行 36 - 44 字节,没有膨胀,NULL 值更少,因此总共 < 300 字节。
\n无论哪种方式,请考虑对您的表进行更激进的autovacuum设置members。有关的:
和/或从一开始就停止让表格膨胀。您是否经常更新行?您经常更新哪个特定专栏?也许是那个jsonb专栏?您可以将其移动到一个单独的 (1:1) 表,只是为了停止用死元组使主表膨胀 - 并阻止其autovacuum完成其工作。
块范围索引需要 Postgres 9.5 或更高版本,并显着减少索引大小。我对初稿过于乐观了。如果每个索引都有很多行,则BRIN 索引非常适合您的用例- 在对表进行至少一次物理聚类之后(有关拟合命令,请参阅 \xe2\x91\xa2)。在这种情况下,Postgres 可以快速排除整个数据页。但是您的数字表明每个 大约只有 8 行,因此数据页通常会包含多个值,这会导致大部分效果无效。取决于您的数据分布的实际细节......memberscontact.idCLUSTERcontact.id
另一方面,就目前情况而言,元组大小约为 1 kb,因此每个数据页只有约 8 行(通常为 8kb)。如果这不是主要的膨胀,那么 BRIN 索引毕竟可能会有所帮助。
\n但您需要先升级您的服务器版本。请参阅\xe2\x91\xa0。
\nCREATE INDEX members_contact_id_brin_idx ON members USING BRIN (contact_id);\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
565 次 |
| 最近记录: |