从大表中获取每个父母的最新子项 - 查询太慢

eri*_*kcw 2 sql django postgresql performance aggregate-functions

我有一个由Django的ORM生成的查询,这需要花费数小时才能运行.

report_rank表(5000万行)与report_profile(100k行)的一对多关系.我正在尝试检索report_rank每个的最新版本report_profile.

我在一个额外的大型Amazon EC2服务器上运行Postgres 9.1,它有足够的可用内存(使用2GB/15GB).磁盘IO当然非常糟糕.

我有索引report_rank.created以及所有外键字段.

我该怎么做才能加快查询速度?我很乐意尝试使用查询的不同方法,如果它将是高性能的,或者调整所需的任何数据库配置参数.

EXPLAIN 
SELECT "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created"
     , MAX(T7."created") AS "max" 
FROM "report_rank" 
LEFT OUTER JOIN "report_site" 
  ON ("report_rank"."site_id" = "report_site"."id") 
INNER JOIN "report_profile" 
  ON ("report_site"."id" = "report_profile"."site_id") 
INNER JOIN "crm_client" 
  ON ("report_profile"."client_id" = "crm_client"."id") 
INNER JOIN "auth_user" 
  ON ("crm_client"."user_id" = "auth_user"."id") 
LEFT OUTER JOIN "report_rank" T7 
  ON ("report_site"."id" = T7."site_id") 
WHERE ("auth_user"."is_active" = True  AND "crm_client"."is_deleted" = False ) 
GROUP BY "report_rank"."id", "report_rank"."keyword_id", "report_rank"."site_id"
     , "report_rank"."rank", "report_rank"."url", "report_rank"."competition"
     , "report_rank"."source", "report_rank"."country", "report_rank"."created" 
HAVING MAX(T7."created") =  "report_rank"."created";
Run Code Online (Sandbox Code Playgroud)

产量EXPLAIN:

GroupAggregate  (cost=1136244292.46..1276589375.47 rows=48133327 width=72)
  Filter: (max(t7.created) = report_rank.created)
  ->  Sort  (cost=1136244292.46..1147889577.16 rows=4658113881 width=72)
        Sort Key: report_rank.id, report_rank.keyword_id, report_rank.site_id, report_rank.rank, report_rank.url, report_rank.competition, report_rank.source, report_rank.country, report_rank.created
        ->  Hash Join  (cost=1323766.36..6107863.59 rows=4658113881 width=72)
              Hash Cond: (report_rank.site_id = report_site.id)
              ->  Seq Scan on report_rank  (cost=0.00..1076119.27 rows=48133327 width=64)
              ->  Hash  (cost=1312601.51..1312601.51 rows=893188 width=16)
                    ->  Hash Right Join  (cost=47050.38..1312601.51 rows=893188 width=16)
                          Hash Cond: (t7.site_id = report_site.id)
                          ->  Seq Scan on report_rank t7  (cost=0.00..1076119.27 rows=48133327 width=12)
                          ->  Hash  (cost=46692.28..46692.28 rows=28648 width=8)
                                ->  Nested Loop  (cost=2201.98..46692.28 rows=28648 width=8)
                                      ->  Hash Join  (cost=2201.98..5733.23 rows=28648 width=4)
                                            Hash Cond: (crm_client.user_id = auth_user.id)
                                            ->  Hash Join  (cost=2040.73..5006.71 rows=44606 width=8)
                                                  Hash Cond: (report_profile.client_id = crm_client.id)
                                                  ->  Seq Scan on report_profile  (cost=0.00..1706.09 rows=93009 width=8)
                                                  ->  Hash  (cost=1761.98..1761.98 rows=22300 width=8)
                                                        ->  Seq Scan on crm_client  (cost=0.00..1761.98 rows=22300 width=8)
                                                              Filter: (NOT is_deleted)
                                            ->  Hash  (cost=126.85..126.85 rows=2752 width=4)
                                                  ->  Seq Scan on auth_user  (cost=0.00..126.85 rows=2752 width=4)
                                                        Filter: is_active
                                      ->  Index Scan using report_site_pkey on report_site  (cost=0.00..1.42 rows=1 width=4)
                                            Index Cond: (id = report_profile.site_id)
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 7

最重要的一点是你JOINGROUP所有的东西都是为了获得max(created).单独获取此值.

您提到了此处所需的所有索引:report_rank.created在外键上和上.你在那里做得很好.(如果你对"好"感兴趣,请继续阅读!)

LEFT JOIN report_site会被迫纯JOINWHERE子句.我取代了一个平原JOIN.我也简化了你的语法.

2015年7月更新,更简单,更快速的查询和更智能的功能.

多行解决方案

report_rank.created不是唯一的,并且希望所有的最新行.在子查询中
使用窗口函数rank().

SELECT r.id, r.keyword_id, r.site_id
     , r.rank, r.url, r.competition
     , r.source, r.country, r.created  -- same as "max"
FROM  (
   SELECT *, rank() OVER (ORDER BY created DESC NULLS LAST) AS rnk
   FROM   report_rank r
   WHERE  EXISTS (
      SELECT *
      FROM   report_site    s
      JOIN   report_profile p ON p.site_id = s.id
      JOIN   crm_client     c ON c.id      = p.client_id
      JOIN   auth_user      u ON u.id      = c.user_id
      WHERE  s.id = r.site_id
      AND    u.is_active
      AND    c.is_deleted = FALSE
      )
   ) sub
WHERE  rnk = 1;
Run Code Online (Sandbox Code Playgroud)

为什么DESC NULLS LAST

一行的解决方案

如果report_rank.created独特的,或者您满意的任何1列max(created):

SELECT id, keyword_id, site_id
     , rank, url, competition
     , source, country, created  -- same as "max"
FROM   report_rank r
WHERE  EXISTS (
    SELECT 1
    FROM   report_site    s
    JOIN   report_profile p ON p.site_id = s.id
    JOIN   crm_client     c ON c.id      = p.client_id
    JOIN   auth_user      u ON u.id      = c.user_id
    WHERE  s.id = r.site_id
    AND    u.is_active
    AND    c.is_deleted = FALSE
   )
-- AND  r.created > f_report_rank_cap()
ORDER  BY r.created DESC NULLS LAST
LIMIT  1;
Run Code Online (Sandbox Code Playgroud)

应该更快,仍然.更多的选择:

终极速度与动态调整的部分索引

您可能已经注意到上一个查询中的注释部分:

AND  r.created > f_report_rank_cap()
Run Code Online (Sandbox Code Playgroud)

你提到50 mio.行,这很多.这是一种加快速度的方法:

  • 创建一个简单的IMMUTABLE函数,返回一个时间戳,该时间戳保证比感兴趣的行更老,同时尽可能年轻.
  • 仅基于此函数在较年轻的行上创建部分索引.
  • 使用WHERE的查询条件相匹配的指数条件.
  • 创建另一个函数,使用动态DDL将这些对象更新到最新行.(如果最新行被删除/停用,则减去安全边际 - 如果可能发生)
  • 在关闭时调用此辅助功能,每个cronjob或按需最少并发活动.随心所欲,无法做到伤害,它只需要在桌子上进行短暂的独占锁定.

这是一个完整的工作演示.
@erikcw,您必须按照以下说明激活注释部分.

CREATE TABLE report_rank(created timestamp);
INSERT INTO report_rank VALUES ('2011-11-11 11:11'),(now());

-- initial function
CREATE OR REPLACE FUNCTION f_report_rank_cap()
  RETURNS timestamp LANGUAGE sql COST 1 IMMUTABLE AS
$y$SELECT timestamp '-infinity'$y$;  -- or as high as you can safely bet.

-- initial index; 1st run indexes whole tbl if starting with '-infinity'
CREATE INDEX report_rank_recent_idx ON report_rank (created DESC NULLS LAST)
WHERE  created > f_report_rank_cap();

-- function to update function & reindex
CREATE OR REPLACE FUNCTION f_report_rank_set_cap()
  RETURNS void AS
$func$
DECLARE
   _secure_margin CONSTANT interval := interval '1 day';  -- adjust to your case
   _cap timestamp;  -- exclude older rows than this from partial index
BEGIN
   SELECT max(created) - _secure_margin
   FROM   report_rank
   WHERE  created > f_report_rank_cap() + _secure_margin
   /*  not needed for the demo; @erikcw needs to activate this
   AND    EXISTS (
     SELECT *
     FROM   report_site    s
     JOIN   report_profile p ON p.site_id = s.id
     JOIN   crm_client     c ON c.id      = p.client_id
     JOIN   auth_user      u ON u.id      = c.user_id
     WHERE  s.id = r.site_id
     AND    u.is_active
     AND    c.is_deleted = FALSE)
   */
   INTO   _cap;

   IF FOUND THEN
     -- recreate function
     EXECUTE format('
     CREATE OR REPLACE FUNCTION f_report_rank_cap()
       RETURNS timestamp LANGUAGE sql IMMUTABLE AS
     $y$SELECT %L::timestamp$y$', _cap);

     -- reindex
     REINDEX INDEX report_rank_recent_idx;
   END IF;
END
$func$  LANGUAGE plpgsql;

COMMENT ON FUNCTION f_report_rank_set_cap()
IS 'Dynamically recreate function f_report_rank_cap()
    and reindex partial index on report_rank.';
Run Code Online (Sandbox Code Playgroud)

呼叫:

SELECT f_report_rank_set_cap();
Run Code Online (Sandbox Code Playgroud)

看到:

SELECT f_report_rank_cap();
Run Code Online (Sandbox Code Playgroud)

取消注释AND r.created > f_report_rank_cap()上面查询中的子句并观察其差异.验证索引是否与之一起使用EXPLAIN ANALYZE.

并发手册和REINDEX:

要在不干扰生产的情况下构建索引,应删除索引并重新发出CREATE INDEX CONCURRENTLY命令.