最初发布:https : //stackoverflow.com/questions/11173717/expensive-query-on-select-distinct-with-multiple-inner-join-in-postgres
歌曲表只有大约 4k 行,帖子和电台更少。运行查询而不DISTINCT ON修复它。
在 Mac OS X Lion 上运行 Postgres。
Song Load (7358.2ms)
EXPLAIN (426.2ms)
EXPLAIN for:
SELECT DISTINCT ON (songs.rank, songs.shared_id) songs.*,
songs.*,
posts.url as post_url,
posts.excerpt as post_excerpt,
stations.title as station_title,
stations.slug as station_slug
FROM "songs"
INNER JOIN "posts" ON "posts"."id" = "songs"."post_id"
inner join stations on stations.blog_id = songs.blog_id
WHERE "songs"."processed" = 't'
AND "songs"."working" = 't'
ORDER BY songs.rank desc
LIMIT 24 OFFSET 0
QUERY PLAN
------------------------------------------------------------------------------------------------
Limit (cost=546147.28..546159.16 rows=24 width=2525)
-> Unique (cost=546147.28..547360.75 rows=2452 width=2525)
-> Sort (cost=546147.28..546551.77 rows=161796 width=2525)
Sort Key: songs.rank, songs.shared_id
-> Hash Join (cost=466.50..2906.84 rows=161796 width=2525)
Hash Cond: (songs.blog_id = stations.blog_id)
-> Hash Join (cost=249.41..587.52 rows=2452 width=2499)
Hash Cond: (songs.post_id = posts.id)
-> Seq Scan on songs (cost=0.00..304.39 rows=2452 width=2223)
Filter: (processed AND working)
-> Hash (cost=230.85..230.85 rows=1485 width=280)
-> Seq Scan on posts (cost=0.00..230.85 rows=1485 width=280)
-> Hash (cost=140.93..140.93 rows=6093 width=30)
-> Seq Scan on stations (cost=0.00..140.93 rows=6093 width=30)
Run Code Online (Sandbox Code Playgroud)
我尝试了一些事情...首先索引(rank, shared_id). 然后移除并添加一个索引rank和shared_id独立,以及三者的组合......没有运气。
索引是否由于某种原因未被使用?或者我需要在添加索引后做些什么来确保它们工作?
因为WHERE查询的条件只涉及相等性检查:
WHERE "songs"."processed" = 't'
AND "songs"."working" = 't'
Run Code Online (Sandbox Code Playgroud)
然后你有:
SELECT DISTINCT ON (songs.rank, songs.shared_id) ...
Run Code Online (Sandbox Code Playgroud)
这类似于 GROUP BY songs.rank, songs.shared_id
我会首先尝试添加一个复合索引(首先是 中的列WHERE,然后是 中的列DISTINCT ON):
(processed, working, rank, shared_id)
Run Code Online (Sandbox Code Playgroud)
排序:ORDER BY rank DESC如果您将索引设为:可能会得到更好的优化:
(processed, working, rank DESC, shared_id)
Run Code Online (Sandbox Code Playgroud)
不确定这是否有助于提高效率,但您可以进行测试。
根据评论中的要求
原则上(默认)b-tree 索引可以以相同的速度向前和向后扫描。但是排序可以在多列索引中产生差异,您可以在其中组合多列的排序顺序。查询开始于:
SELECT DISTINCT ON (songs.rank, songs.shared_id)
Run Code Online (Sandbox Code Playgroud)
与ORDER BY rank DESC此相结合,决定了结果的rank DESC, shared_id有效排序。在(简化的)WHERE 子句WHERE processed AND working被应用之后和之前LIMIT可以被应用。
我怀疑该DISTINCT条款是否真的有用。但是当它存在时,查询的最佳索引应该是(正如@ypercube 怀疑的那样):
CREATE INDEX songs_special_idx
ON songs (processed, working, rank DESC, shared_id);
Run Code Online (Sandbox Code Playgroud)
看起来像索引列的显式排序将有利于查询的罕见情况之一。手册的Indexes 和 ORDER BY一章中有很好的解释。
如果 WHERE 条件稳定(总是WHERE processed AND working),部分多列索引会更小更快,但是:
CREATE INDEX songs_special_idx
ON songs (rank DESC, shared_id)
WHERE processed AND working;
Run Code Online (Sandbox Code Playgroud)