Cla*_*ley 9 postgresql performance execution-plan query-performance
我在加载 PostgreSQL 9.3.4 时遇到了一些与 StackOverflow 模式相关的数据的问题。我有一个查询比它应该花费的时间长大约 10 倍,因为它选择使用散列连接而不是带有索引的嵌套循环。例如,如果我在查询中选择 500 个用户,则使用散列连接而不是使用 post_tokenized 表上的 id 和类型索引:
explain
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
join posts as posts_tbl
on posts_tbl.id = tokenized_tbl.id
where type = 'tag'
and user_screen_name is not null
and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)
and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500))
Hash Join (cost=29570.13..751852.55 rows=119954 width=21)
Hash Cond: (tokenized_tbl.id = posts_tbl.id)
-> Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl (cost=0.44..646219.29 rows=20281711 width=8)
Index Cond: (type = 'tag'::text)
-> Hash (cost=29561.73..29561.73 rows=637 width=25)
-> Hash Join (cost=15576.75..29561.73 rows=637 width=25)
Hash Cond: (posts_tbl.id = posts.id)
-> Nested Loop (cost=48.20..12824.71 rows=106853 width=21)
-> HashAggregate (cost=47.76..52.76 rows=500 width=4)
-> Limit (cost=0.43..41.51 rows=500 width=8)
-> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.40 rows=214 width=25)
Index Cond: (owner_user_id = users.id)
Filter: (user_screen_name IS NOT NULL)
-> Hash (cost=14181.63..14181.63 rows=107754 width=4)
-> HashAggregate (cost=13104.09..14181.63 rows=107754 width=4)
-> Nested Loop (cost=48.20..12834.71 rows=107754 width=4)
-> HashAggregate (cost=47.76..52.76 rows=500 width=4)
-> Limit (cost=0.43..41.51 rows=500 width=8)
-> Index Scan using reputation_index_users on users users_1 (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts (cost=0.44..23.40 rows=216 width=8)
Index Cond: (owner_user_id = users_1.id)
Run Code Online (Sandbox Code Playgroud)
但是,如果我将用户数量减少到 200,则使用带有索引的嵌套循环(更快):
explain
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
join posts as posts_tbl
on posts_tbl.id = tokenized_tbl.id
where type = 'tag'
and user_screen_name is not null
and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200)
and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200))
Nested Loop (cost=6633.63..466114.15 rows=47982 width=21)
-> Hash Join (cost=6291.07..11836.00 rows=102 width=25)
Hash Cond: (posts_tbl.id = posts.id)
-> Nested Loop (cost=19.80..5189.72 rows=42741 width=21)
-> HashAggregate (cost=19.36..21.36 rows=200 width=4)
-> Limit (cost=0.43..16.86 rows=200 width=8)
-> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.70 rows=214 width=25)
Index Cond: (owner_user_id = users.id)
Filter: (user_screen_name IS NOT NULL)
-> Hash (cost=5732.50..5732.50 rows=43102 width=4)
-> HashAggregate (cost=5301.48..5732.50 rows=43102 width=4)
-> Nested Loop (cost=19.80..5193.72 rows=43102 width=4)
-> HashAggregate (cost=19.36..21.36 rows=200 width=4)
-> Limit (cost=0.43..16.86 rows=200 width=8)
-> Index Scan using reputation_index_users on users users_1 (cost=0.43..211.57 rows=2570 width=8)
Index Cond: (reputation > 100000)
-> Index Scan using owner_user_id_index_posts on posts (cost=0.44..23.70 rows=216 width=8)
Index Cond: (owner_user_id = users_1.id)
-> Bitmap Heap Scan on post_tokenized tokenized_tbl (cost=342.56..4448.69 rows=502 width=8)
Recheck Cond: (id = posts_tbl.id)
Filter: (type = 'tag'::text)
-> Bitmap Index Scan on id_index_post_tokenized (cost=0.00..342.44 rows=43656 width=0)
Index Cond: (id = posts_tbl.id)
Run Code Online (Sandbox Code Playgroud)
选择500个用户时,如何获得相同的计划(嵌套循环使用索引)?我尝试调整以下参数:cpu_tuple_cost, seq_page_cost, random_page_cost, effective_cache_size, ( ref ) 但我不知道如何更改计划。似乎计划随着请求的用户数量的增加而变化,但从我的环境中的测试来看,如果 Postgres 保持相同的计划,即使在 500 个用户时也会快得多。
Erw*_*ter 12
这个与 SO 密切相关的答案应该为您的主要问题提供答案:
在单个 SELECT 查询中设置 enable_seqscan = off
您可以以类似的方式使用,禁用当前事务的哈希连接:
SET LOCAL enable_hashjoin=off;
Run Code Online (Sandbox Code Playgroud)
但这不是我的建议。阅读那里的答案。
这也是关于统计和成本设置的。
更重要的是,首先解开您的查询:
SELECT creation_epoch, user_screen_name, chunk
FROM (
SELECT id AS owner_user_id
FROM users
WHERE reputation > 100000
ORDER BY reputation
LIMIT 500
) u
JOIN posts p USING (owner_user_id)
JOIN post_tokenized t USING (id)
WHERE type = 'tag'
AND user_screen_name IS NOT NULL;
Run Code Online (Sandbox Code Playgroud)
应该快得多,并且还使查询计划器更容易选择最佳计划(给定合理的成本设置和表统计信息)。