eri*_*kcw 2 sql django postgresql performance aggregate-functions
我有一个由Django的ORM生成的查询,这需要花费数小时才能运行.
该report_rank
表(5000万行)与report_profile
(100k行)的一对多关系.我正在尝试检索report_rank
每个的最新版本report_profile
.
我在一个额外的大型Amazon EC2服务器上运行Postgres 9.1,它有足够的可用内存(使用2GB/15GB).磁盘IO当然非常糟糕.
我有索引report_rank.created
以及所有外键字段.
我该怎么做才能加快查询速度?我很乐意尝试使用查询的不同方法,如果它将是高性能的,或者调整所需的任何数据库配置参数.
EXPLAIN
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
, MAX(T7."created") AS "max"
FROM "report_rank"
LEFT OUTER JOIN "report_site"
ON ("report_rank"."site_id" = "report_site"."id")
INNER JOIN "report_profile"
ON ("report_site"."id" = "report_profile"."site_id")
INNER JOIN "crm_client"
ON ("report_profile"."client_id" = "crm_client"."id")
INNER JOIN "auth_user"
ON ("crm_client"."user_id" = "auth_user"."id")
LEFT OUTER JOIN "report_rank" T7
ON ("report_site"."id" = T7."site_id")
WHERE ("auth_user"."is_active" = True AND "crm_client"."is_deleted" = False )
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
, "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
, "report_rank"."source", "report_rank"."country", "report_rank"."created"
HAVING MAX(T7."created") = "report_rank"."created";
Run Code Online (Sandbox Code Playgroud)
产量EXPLAIN
:
GroupAggregate (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
Filter: (max(t7.created) = report_rank.created)
-> Sort (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
-> Hash Join (cost=1323766.36..6107863.59 rows=4658113881 width=72)
Hash Cond: (report_rank.site_id = report_site.id)
-> Seq Scan on report_rank (cost=0.00..1076119.27 rows=48133327 width=64)
-> Hash (cost=1312601.51..1312601.51 rows=893188 width=16)
-> Hash Right Join (cost=47050.38..1312601.51 rows=893188 width=16)
Hash Cond: (t7.site_id = report_site.id)
-> Seq Scan on report_rank t7 (cost=0.00..1076119.27 rows=48133327 width=12)
-> Hash (cost=46692.28..46692.28 rows=28648 width=8)
-> Nested Loop (cost=2201.98..46692.28 rows=28648 width=8)
-> Hash Join (cost=2201.98..5733.23 rows=28648 width=4)
Hash Cond: (crm_client.user_id = auth_user.id)
-> Hash Join (cost=2040.73..5006.71 rows=44606 width=8)
Hash Cond: (report_profile.client_id = crm_client.id)
-> Seq Scan on report_profile (cost=0.00..1706.09 rows=93009 width=8)
-> Hash (cost=1761.98..1761.98 rows=22300 width=8)
-> Seq Scan on crm_client (cost=0.00..1761.98 rows=22300 width=8)
Filter: (NOT is_deleted)
-> Hash (cost=126.85..126.85 rows=2752 width=4)
-> Seq Scan on auth_user (cost=0.00..126.85 rows=2752 width=4)
Filter: is_active
-> Index Scan using report_site_pkey on report_site (cost=0.00..1.42 rows=1 width=4)
Index Cond: (id = report_profile.site_id)
Run Code Online (Sandbox Code Playgroud)
最重要的一点是你JOIN
和GROUP
所有的东西都是为了获得max(created)
.单独获取此值.
您提到了此处所需的所有索引:report_rank.created
在外键上和上.你在那里做得很好.(如果你对"好"感兴趣,请继续阅读!)
该LEFT JOIN report_site
会被迫纯JOIN
由WHERE
子句.我取代了一个平原JOIN
.我也简化了你的语法.
2015年7月更新,更简单,更快速的查询和更智能的功能.
report_rank.created
是不是唯一的,并且希望所有的最新行.在子查询中
使用窗口函数rank()
.
SELECT r.id, r.keyword_id, r.site_id
, r.rank, r.url, r.competition
, r.source, r.country, r.created -- same as "max"
FROM (
SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
FROM report_rank r
WHERE EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE
)
) sub
WHERE rnk = 1;
Run Code Online (Sandbox Code Playgroud)
为什么DESC NULLS LAST
?
如果report_rank.created
是独特的,或者您满意的任何1列有max(created)
:
SELECT id, keyword_id, site_id
, rank, url, competition
, source, country, created -- same as "max"
FROM report_rank r
WHERE EXISTS (
SELECT 1
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE
)
-- AND r.created > f_report_rank_cap()
ORDER BY r.created DESC NULLS LAST
LIMIT 1;
Run Code Online (Sandbox Code Playgroud)
应该更快,仍然.更多的选择:
您可能已经注意到上一个查询中的注释部分:
AND r.created > f_report_rank_cap()
Run Code Online (Sandbox Code Playgroud)
你提到50 mio.行,这很多.这是一种加快速度的方法:
IMMUTABLE
函数,返回一个时间戳,该时间戳保证比感兴趣的行更老,同时尽可能年轻.WHERE
的查询条件相匹配的指数条件.这是一个完整的工作演示.
@erikcw,您必须按照以下说明激活注释部分.
CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());
-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$; -- or as high as you can safely bet.
-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE created > f_report_rank_cap();
-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
RETURNS void AS
$func$
DECLARE
_secure_margin CONSTANT interval := interval '1 day'; -- adjust to your case
_cap timestamp; -- exclude older rows than this from partial index
BEGIN
SELECT max(created) - _secure_margin
FROM report_rank
WHERE created > f_report_rank_cap() + _secure_margin
/* not needed for the demo; @erikcw needs to activate this
AND EXISTS (
SELECT *
FROM report_site s
JOIN report_profile p ON p.site_id = s.id
JOIN crm_client c ON c.id = p.client_id
JOIN auth_user u ON u.id = c.user_id
WHERE s.id = r.site_id
AND u.is_active
AND c.is_deleted = FALSE)
*/
INTO _cap;
IF FOUND THEN
-- recreate function
EXECUTE format('
CREATE OR REPLACE FUNCTION f_report_rank_cap()
RETURNS timestamp LANGUAGE sql IMMUTABLE AS
$y$SELECT %L::timestamp$y$', _cap);
-- reindex
REINDEX INDEX report_rank_recent_idx;
END IF;
END
$func$ LANGUAGE plpgsql;
COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
and reindex partial index on report_rank.';
Run Code Online (Sandbox Code Playgroud)
呼叫:
SELECT f_report_rank_set_cap();
Run Code Online (Sandbox Code Playgroud)
看到:
SELECT f_report_rank_cap();
Run Code Online (Sandbox Code Playgroud)
取消注释AND r.created > f_report_rank_cap()
上面查询中的子句并观察其差异.验证索引是否与之一起使用EXPLAIN ANALYZE
.
要在不干扰生产的情况下构建索引,应删除索引并重新发出
CREATE INDEX CONCURRENTLY
命令.