让 Postgresql 查询规划器在哈希连接上使用带有索引的嵌套循环

Cla*_*ley 9 postgresql performance execution-plan query-performance

我在加载 PostgreSQL 9.3.4 时遇到了一些与 StackOverflow 模式相关的数据的问题。我有一个查询比它应该花费的时间长大约 10 倍,因为它选择使用散列连接而不是带有索引的嵌套循环。例如,如果我在查询中选择 500 个用户,则使用散列连接而不是使用 post_tokenized 表上的 id 和类型索引:

explain 
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
                    join posts as posts_tbl
                    on posts_tbl.id = tokenized_tbl.id
                    where type = 'tag'
                    and user_screen_name is not null
                    and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)
                    and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500))

 Hash Join  (cost=29570.13..751852.55 rows=119954 width=21)
   Hash Cond: (tokenized_tbl.id = posts_tbl.id)
   ->  Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl  (cost=0.44..646219.29 rows=20281711 width=8)
         Index Cond: (type = 'tag'::text)
   ->  Hash  (cost=29561.73..29561.73 rows=637 width=25)
         ->  Hash Join  (cost=15576.75..29561.73 rows=637 width=25)
               Hash Cond: (posts_tbl.id = posts.id)
               ->  Nested Loop  (cost=48.20..12824.71 rows=106853 width=21)
                     ->  HashAggregate  (cost=47.76..52.76 rows=500 width=4)
                           ->  Limit  (cost=0.43..41.51 rows=500 width=8)
                                 ->  Index Scan using reputation_index_users on users  (cost=0.43..211.57 rows=2570 width=8)
                                       Index Cond: (reputation > 100000)
                     ->  Index Scan using owner_user_id_index_posts on posts posts_tbl  (cost=0.44..23.40 rows=214 width=25)
                           Index Cond: (owner_user_id = users.id)
                           Filter: (user_screen_name IS NOT NULL)
               ->  Hash  (cost=14181.63..14181.63 rows=107754 width=4)
                     ->  HashAggregate  (cost=13104.09..14181.63 rows=107754 width=4)
                           ->  Nested Loop  (cost=48.20..12834.71 rows=107754 width=4)
                                 ->  HashAggregate  (cost=47.76..52.76 rows=500 width=4)
                                       ->  Limit  (cost=0.43..41.51 rows=500 width=8)
                                             ->  Index Scan using reputation_index_users on users users_1  (cost=0.43..211.57 rows=2570 width=8)
                                                   Index Cond: (reputation > 100000)
                                 ->  Index Scan using owner_user_id_index_posts on posts  (cost=0.44..23.40 rows=216 width=8)
                                       Index Cond: (owner_user_id = users_1.id)
Run Code Online (Sandbox Code Playgroud)

但是,如果我将用户数量减少到 200,则使用带有索引的嵌套循环(更快):

explain 
select creation_epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl
                    join posts as posts_tbl
                    on posts_tbl.id = tokenized_tbl.id
                    where type = 'tag'
                    and user_screen_name is not null
                    and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200)
                    and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200))

 Nested Loop  (cost=6633.63..466114.15 rows=47982 width=21)
   ->  Hash Join  (cost=6291.07..11836.00 rows=102 width=25)
         Hash Cond: (posts_tbl.id = posts.id)
         ->  Nested Loop  (cost=19.80..5189.72 rows=42741 width=21)
               ->  HashAggregate  (cost=19.36..21.36 rows=200 width=4)
                     ->  Limit  (cost=0.43..16.86 rows=200 width=8)
                           ->  Index Scan using reputation_index_users on users  (cost=0.43..211.57 rows=2570 width=8)
                                 Index Cond: (reputation > 100000)
               ->  Index Scan using owner_user_id_index_posts on posts posts_tbl  (cost=0.44..23.70 rows=214 width=25)
                     Index Cond: (owner_user_id = users.id)
                     Filter: (user_screen_name IS NOT NULL)
         ->  Hash  (cost=5732.50..5732.50 rows=43102 width=4)
               ->  HashAggregate  (cost=5301.48..5732.50 rows=43102 width=4)
                     ->  Nested Loop  (cost=19.80..5193.72 rows=43102 width=4)
                           ->  HashAggregate  (cost=19.36..21.36 rows=200 width=4)
                                 ->  Limit  (cost=0.43..16.86 rows=200 width=8)
                                       ->  Index Scan using reputation_index_users on users users_1  (cost=0.43..211.57 rows=2570 width=8)
                                             Index Cond: (reputation > 100000)
                           ->  Index Scan using owner_user_id_index_posts on posts  (cost=0.44..23.70 rows=216 width=8)
                                 Index Cond: (owner_user_id = users_1.id)
   ->  Bitmap Heap Scan on post_tokenized tokenized_tbl  (cost=342.56..4448.69 rows=502 width=8)
         Recheck Cond: (id = posts_tbl.id)
         Filter: (type = 'tag'::text)
         ->  Bitmap Index Scan on id_index_post_tokenized  (cost=0.00..342.44 rows=43656 width=0)
               Index Cond: (id = posts_tbl.id)
Run Code Online (Sandbox Code Playgroud)

选择500个用户时,如何获得相同的计划(嵌套循环使用索引)?我尝试调整以下参数:cpu_tuple_cost, seq_page_cost, random_page_cost, effective_cache_size, ( ref ) 但我不知道如何更改计划。似乎计划随着请求的用户数量的增加而变化,但从我的环境中的测试来看,如果 Postgres 保持相同的计划,即使在 500 个用户时也会快得多。

Erw*_*ter 12

这个与 SO 密切相关的答案应该为您的主要问题提供答案:
在单个 SELECT 查询中设置 enable_seqscan = off

可以以类似的方式使用,禁用当前事务的哈希连接:

SET LOCAL enable_hashjoin=off;
Run Code Online (Sandbox Code Playgroud)

但这不是我的建议。阅读那里的答案。
这也是关于统计和成本设置的。

更重要的是,首先解开您的查询:

SELECT creation_epoch, user_screen_name, chunk
FROM  (
   SELECT id AS owner_user_id
   FROM   users
   WHERE  reputation > 100000
   ORDER  BY reputation 
   LIMIT  500
   ) u
JOIN   posts p USING (owner_user_id)
JOIN   post_tokenized t USING (id)
WHERE  type = 'tag'
AND    user_screen_name IS NOT NULL;
Run Code Online (Sandbox Code Playgroud)

应该快得多,并且还使查询计划器更容易选择最佳计划(给定合理的成本设置和表统计信息)。