每周汇总最近加入的记录

Eri*_*low 10 sql postgresql greatest-n-per-group

updates在Postgres 有一张表是9.4.5像这样:

goal_id    | created_at | status
1          | 2016-01-01 | green
1          | 2016-01-02 | red
2          | 2016-01-02 | amber
Run Code Online (Sandbox Code Playgroud)

和这样的goals表:

id | company_id
1  | 1
2  | 2
Run Code Online (Sandbox Code Playgroud)

我想为每家公司创建一个图表,每周显示所有目标的状态.

示例图表

我想这需要生成一系列过去8周,找到该周之前的每个目标的最新更新,然后计算找到的更新的不同状态.

到目前为止我所拥有的:

SELECT EXTRACT(year from generate_series) AS year, 
       EXTRACT(week from generate_series) AS week,
       u.company_id,
       COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
       COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
       COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
  SELECT DISTINCT ON(year, week)
         goals.company_id,
         updates.status, 
         EXTRACT(week from updates.created_at) week,
         EXTRACT(year from updates.created_at) AS year,
         updates.created_at 
  FROM updates
  JOIN goals ON goals.id = updates.goal_id
  ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
Run Code Online (Sandbox Code Playgroud)

但这有两个问题.似乎加入u并没有像我想象的那样工作.它似乎是从内部查询返回的每一行(?)加入,并且这只选择从该周发生的最新更新.如果需要,它应该从该周之前获取最新更新.

这是一些相当复杂的SQL,我喜欢关于如何将它拉下来的一些输入.

表结构和信息

目标表大约有1000个目标ATM,并且每周增长约100个:

                                           Table "goals"
     Column      |            Type             |                         Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
 id              | integer                     | not null default nextval('goals_id_seq'::regclass)
 company_id      | integer                     | not null
 name            | text                        | not null
 created_at      | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at      | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goals_pkey" PRIMARY KEY, btree (id)
    "entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
    "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
Run Code Online (Sandbox Code Playgroud)

updates表约有1000左右,每周增长约100个:

                                         Table "updates"
   Column   |            Type             |                            Modifiers
------------+-----------------------------+------------------------------------------------------------------
 id         | integer                     | not null default nextval('updates_id_seq'::regclass)
 status     | entity.goalstatus           | not null
 goal_id    | integer                     | not null
 created_at | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goal_updates_pkey" PRIMARY KEY, btree (id)
    "entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
    "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE

 Schema |       Name        | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
 entity | entity.goalstatus | goalstatus    | 4    | green   +|                   |
        |                   |               |      | amber   +|                   |
        |                   |               |      | red      |                   |
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 7

您需要每周一个数据项目和目标(在汇总每个公司的计数之前).这是和CROSS JOIN之间的平原.(可能)昂贵的部分是从每个获得电流.就像@Paul已经建议的那样,连接似乎是最好的工具.不过只做它,并使用更快的技术.generate_series()goalsstateupdatesLATERALupdatesLIMIT 1

并简化日期处理date_trunc().

SELECT w_start
     , g.company_id
     , count(*) FILTER (WHERE u.status = 'green') AS green_count
     , count(*) FILTER (WHERE u.status = 'amber') AS amber_count
     , count(*) FILTER (WHERE u.status = 'red')   AS red_count
FROM   generate_series(date_trunc('week', NOW() - interval '2 months')
                     , date_trunc('week', NOW())
                     , interval '1 week') w_start
CROSS  JOIN goals g
LEFT   JOIN LATERAL (
   SELECT status
   FROM   updates
   WHERE  goal_id = g.id
   AND    created_at < w_start
   ORDER  BY created_at DESC
   LIMIT  1
   ) u ON true
GROUP  BY w_start, g.company_id
ORDER  BY w_start, g.company_id;
Run Code Online (Sandbox Code Playgroud)

快速实现这一目标,您需要一个多列索引:

CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
Run Code Online (Sandbox Code Playgroud)

降序created_at是最好的,但不是绝对必要的.Postgres几乎可以快速地向后扫描索引.(但不适用于多列的反向排序顺序.)

指数列在顺序.为什么?

第三列status只添加到允许快速索引只扫描updates.相关案例:

9周的1k目标(2个月的间隔与至少9周重叠)仅需要9k索引查找仅第1行的第2个表.对于像这样的小表,性能应该不是很大的问题.但是,如果每个表中还有几千个,则顺序扫描会降低性能.

w_start代表每周的开始.因此,计数是在一周的开始.你可以仍然提取年份和星期(或任何其他细节代表你的一周),如果你坚持:

   EXTRACT(isoyear from w_start) AS year
 , EXTRACT(week    from w_start) AS week
Run Code Online (Sandbox Code Playgroud)

最好的ISOYEAR,就像@Paul解释的那样.

SQL小提琴.

有关: