具有多个连接的 DISTINCT ON 查询真的很慢

Nat*_*ert 2 postgresql join

最初发布:https : //stackoverflow.com/questions/11173717/expensive-query-on-select-distinct-with-multiple-inner-join-in-postgres

歌曲表只有大约 4k 行,帖子和电台更少。运行查询而不DISTINCT ON修复它。

在 Mac OS X Lion 上运行 Postgres。

Song Load (7358.2ms)

EXPLAIN (426.2ms)

EXPLAIN for: 
SELECT  DISTINCT ON (songs.rank, songs.shared_id) songs.*, 
        songs.*, 
        posts.url as post_url, 
        posts.excerpt as post_excerpt, 
        stations.title as station_title, 
        stations.slug as station_slug 
FROM "songs" 
    INNER JOIN "posts" ON "posts"."id" = "songs"."post_id" 
    inner join stations on stations.blog_id = songs.blog_id 
WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't' 
ORDER BY songs.rank desc 
LIMIT 24 OFFSET 0

                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Limit  (cost=546147.28..546159.16 rows=24 width=2525)
   ->  Unique  (cost=546147.28..547360.75 rows=2452 width=2525)
         ->  Sort  (cost=546147.28..546551.77 rows=161796 width=2525)
               Sort Key: songs.rank, songs.shared_id
               ->  Hash Join  (cost=466.50..2906.84 rows=161796 width=2525)
                     Hash Cond: (songs.blog_id = stations.blog_id)
                     ->  Hash Join  (cost=249.41..587.52 rows=2452 width=2499)
                           Hash Cond: (songs.post_id = posts.id)
                           ->  Seq Scan on songs  (cost=0.00..304.39 rows=2452 width=2223)
                                 Filter: (processed AND working)
                           ->  Hash  (cost=230.85..230.85 rows=1485 width=280)
                                 ->  Seq Scan on posts  (cost=0.00..230.85 rows=1485 width=280)
                     ->  Hash  (cost=140.93..140.93 rows=6093 width=30)
                           ->  Seq Scan on stations  (cost=0.00..140.93 rows=6093 width=30)
Run Code Online (Sandbox Code Playgroud)

我尝试了一些事情...首先索引(rank, shared_id). 然后移除并添加一个索引rankshared_id独立,以及三者的组合......没有运气。

索引是否由于某种原因未被使用?或者我需要在添加索引后做些什么来确保它们工作?

ype*_*eᵀᴹ 5

因为WHERE查询的条件只涉及相等性检查:

WHERE "songs"."processed" = 't' 
  AND "songs"."working" = 't'
Run Code Online (Sandbox Code Playgroud)

然后你有:

SELECT  DISTINCT ON (songs.rank, songs.shared_id) ...
Run Code Online (Sandbox Code Playgroud)

这类似于 GROUP BY songs.rank, songs.shared_id

我会首先尝试添加一个复合索引(首先是 中的列WHERE,然后是 中的列DISTINCT ON):

(processed, working, rank, shared_id)
Run Code Online (Sandbox Code Playgroud)

排序:ORDER BY rank DESC如果您将索引设为:可能会得到更好的优化:

(processed, working, rank DESC, shared_id)
Run Code Online (Sandbox Code Playgroud)

不确定这是否有助于提高效率,但您可以进行测试。


@Erwin 添加

根据评论中的要求

原则上(默认)b-tree 索引可以以相同的速度向前和向后扫描。但是排序可以在多列索引中产生差异,您可以在其中组合多列的排序顺序。查询开始于:

SELECT  DISTINCT ON (songs.rank, songs.shared_id)
Run Code Online (Sandbox Code Playgroud)

ORDER BY rank DESC此相结合,决定了结果的rank DESC, shared_id有效排序。在(简化的)WHERE 子句WHERE processed AND working被应用之后和之前LIMIT可以被应用。
我怀疑该DISTINCT条款是否真的有用。但是当它存在时,查询的最佳索引应该是(正如@ypercube 怀疑的那样):

CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);
Run Code Online (Sandbox Code Playgroud)

看起来像索引列的显式排序将有利于查询的罕见情况之一。手册的Indexes 和 ORDER BY一章中有很好的解释。

如果 WHERE 条件稳定(总是WHERE processed AND working),部分多列索引会更小更快,但是:

CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;
Run Code Online (Sandbox Code Playgroud)

  • 我在这里变得挑剔,因为@ErwinBrandstetter 从根本上说是正确的,而且我将要提出的好点可能只会在极少数情况下在使用深度、易变索引的高度竞争的工作负载中被注意到;但是由于锁定技术,反向索引扫描不能在不访问父级的情况下跟随同级指针。除非在尝试过程中发生阻塞,否则差异可能不大。在我刚刚描述的环境中需要记住的事情。 (2认同)