可扩展查询前 x 天内的事件运行计数

Jul*_*don 5 postgresql performance scalability window-functions postgresql-performance

我已经在stackoverflow上发布了这个问题,但我想我可能会在这里得到更好的答案。
我有一个表存储用户发生的数百万个事件:

                                       Table "public.events"
   Column   |           Type           |                         Modifiers                         
------------+--------------------------+-----------------------------------------------------------
 event_id   | integer                  | not null default nextval('events_event_id_seq'::regclass)
 user_id    | bigint                   | 
 event_type | integer                  | 
 ts         | timestamp with time zone | 
Run Code Online (Sandbox Code Playgroud)

event_type 有 5 个不同的值、数百万用户以及每个用户每个 event_type 的不同事件数,通常范围为 1 到 50。

数据样本:

+-----------+----------+-------------+----------------------------+
| event_id  | user_id  | event_type  |         timestamp          |
+-----------+----------+-------------+----------------------------+
|        1  |       1  |          1  | January, 01 2015 00:00:00  |
|        2  |       1  |          1  | January, 10 2015 00:00:00  |
|        3  |       1  |          1  | January, 20 2015 00:00:00  |
|        4  |       1  |          1  | January, 30 2015 00:00:00  |
|        5  |       1  |          1  | February, 10 2015 00:00:00 |
|        6  |       1  |          1  | February, 21 2015 00:00:00 |
|        7  |       1  |          1  | February, 22 2015 00:00:00 |
+-----------+----------+-------------+----------------------------+
Run Code Online (Sandbox Code Playgroud)

我想获取每个事件的同一用户的事件数以及事件event_type发生前 30 天内发生的事件数。

它应该如下所示:

+-----------+----------+-------------+-----------------------------+-------+
| event_id  | user_id  | event_type  |         timestamp           | count |
+-----------+----------+-------------+-----------------------------+-------+
|        1  |       1  |          1  | January, 01 2015 00:00:00   |     1 |
|        2  |       1  |          1  | January, 10 2015 00:00:00   |     2 |
|        3  |       1  |          1  | January, 20 2015 00:00:00   |     3 |
|        4  |       1  |          1  | January, 30 2015 00:00:00   |     4 |
|        5  |       1  |          1  | February, 10 2015 00:00:00  |     3 |
|        6  |       1  |          1  | February, 21 2015 00:00:00  |     3 |
|        7  |       1  |          1  | February, 22 2015 00:00:00  |     4 |
+-----------+----------+-------------+-----------------------------+-------+
Run Code Online (Sandbox Code Playgroud)

到目前为止,我成功地使用了两个不同的查询(在 PostgreSQL 9.4.1 上生成的 1000 行示例进行测试):

SELECT 
  event_id, user_id,event_type,"timestamp", 
  (
    SELECT count(*) 
    FROM events 
    WHERE timestamp >= e.timestamp - interval '30 days'
    AND timestamp <= e.timestamp
    AND user_id = e.user_id 
    AND event_type = e.event_type
    GROUP BY event_type, user_id
  ) as "count"
FROM events e;
Run Code Online (Sandbox Code Playgroud)

第一个查询的 SQL Fiddle

它工作得很好,特别是因为我有时间戳索引:

Index Scan using pk_event_id on events e  (cost=0.28..12018.74 rows=1000 width=24)
SubPlan 1
  ->  GroupAggregate  (cost=4.33..11.97 rows=1 width=20)
        Group Key: events.event_type, events.user_id
        ->  Bitmap Heap Scan on events  (cost=4.33..11.95 rows=1 width=20)
              Recheck Cond: ((""timestamp"" >= (e."timestamp" - '30 days'::interval)) AND ("timestamp" <= e."timestamp"))
              Filter: ((user_id = e.user_id) AND (event_type = e.event_type))
              ->  Bitmap Index Scan on idx_events_timestamp  (cost=0.00..4.33 rows=5 width=0)
                    Index Cond: ((""timestamp"" >= (e."timestamp" - '30 days'::interval)) AND ("timestamp" <= e."timestamp"))
Run Code Online (Sandbox Code Playgroud)

尽管如此,它的扩展性仍然不佳,我认为使用窗口函数可能会提高性能:

SELECT toto.event_id,toto.user_id,toto.event_type,toto.lv as time,COUNT(*)
FROM(
    SELECT e.event_id, e.user_id,e.event_type,"timestamp",
    last_value("timestamp") OVER w as lv,
    unnest(array_agg(e."timestamp") OVER w) as agg
    FROM events e
    WINDOW w AS (PARTITION BY e.user_id,e.event_type ORDER BY e."timestamp"
    ROWS UNBOUNDED PRECEDING)) AS toto
WHERE toto.agg >= toto.lv - interval '30 days'
GROUP by event_id,user_id,event_type,lv;
Run Code Online (Sandbox Code Playgroud)

用于第二个查询的 SQL Fiddle

由于我必须使用 unnest 和子查询,因此性能实际上变得更糟:

Sort  (cost=5344.41..5427.74 rows=33333 width=24)
  Sort Key: toto.event_id
  ->  HashAggregate  (cost=2506.99..2840.32 rows=33333 width=24)
        Group Key: toto.event_id, toto.user_id, toto.event_type, toto.lv
        ->  Subquery Scan on toto  (cost=67.83..2090.33 rows=33333 width=24)
              Filter: (toto.agg >= (toto.lv - '30 days'::interval))
              ->  WindowAgg  (cost=67.83..590.33 rows=100000 width=24)
                    ->  Sort  (cost=67.83..70.33 rows=1000 width=24)
                          Sort Key: e.user_id, e.event_type, e."timestamp"
                          ->  Seq Scan on events e  (cost=0.00..18.00 rows=1000 width=24)
Run Code Online (Sandbox Code Playgroud)

我想知道是否可以修改是否只能保留子查询并以某种方式修改窗口框架以仅保留行时间戳之前 30 天或更短的时间戳。您认为是否可以在不切换到 MapReduce 框架的情况下针对非常大的表扩展此查询?

第二次,我想排除重复的事件,即相同的event_type时间戳。

Erw*_*ter 5

假设这个已清理的表定义

CREATE TABLE events (
  event_id   serial PRIMARY KEY
, user_id    int
, event_type int
, ts         timestamp  -- don't use reserved word as identifier
);
Run Code Online (Sandbox Code Playgroud)

您的比较似乎不公平,第一个查询有ORDER BY event_id,但第二个查询没有。输出EXPLAIN不适合第一个查询(无排序步骤)。请务必使用相同的子句运行所有测试ORDER BY以获得有效结果。最好运行几次并比较 5 次中的最佳值以消除缓存影响。

指数

性能的关键是这个多列索引

CREATE INDEX events_fast_idx ON events (user_id, event_type, ts);
Run Code Online (Sandbox Code Playgroud)

列的顺序很重要!为什么?

查询

您的每个查询都可以改进:

查询1

删除group by event_type, user_id而不替换:

SELECT event_id, user_id, event_type, ts
    , (SELECT count(*) 
       FROM   events 
       WHERE  user_id    = e.user_id 
       AND    event_type = e.event_type
       AND    ts >= e.ts - interval '30 days'
       AND    ts <= e.ts
      ) AS  ct
FROM   events e
ORDER  BY event_id;
Run Code Online (Sandbox Code Playgroud)

相当于更现代的LATERAL连接(Postgres 9.3+):

SELECT *
FROM   events e
    ,  LATERAL (
   SELECT count(*) AS ct
   FROM   events 
   WHERE  user_id    = e.user_id 
   AND    event_type = e.event_type
   AND    ts >= e.ts - interval '30 days'
   AND    ts <= e.ts
   ) ct
ORDER  BY event_id;
Run Code Online (Sandbox Code Playgroud)

这也可能是与上述索引结合最快的查询。
相关答案及更多解释:

查询2

  • last_value(ts) OVER w as lv只是一个昂贵的副本ts
  • ROWS UNBOUNDED PRECEDING是默认值,因此只是噪音。

SELECT event_id, user_id, event_type, ts, count(*) AS ct
FROM  (
   SELECT event_id, user_id, event_type, ts
        , unnest(array_agg(ts) OVER (PARTITION BY user_id, event_type
                                     ORDER BY ts)) AS agg
   FROM   events   
   ) e
WHERE  agg >= ts - interval '30 days'
GROUP  BY event_id, user_id, event_type, ts
ORDER  BY event_id;
Run Code Online (Sandbox Code Playgroud)

但这是不必要的复杂。使用连接而不是使用窗口函数的子查询可以更便宜地获得相同的逻辑:

SELECT e.*, count(*) AS ct
FROM   events e
JOIN   events x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id
ORDER  BY e.event_id;
Run Code Online (Sandbox Code Playgroud)

这是我最喜欢的另一个顶级性能。再次使用上面的索引。

其他查询

这是另一个想法,但我怀疑它是否可以竞争。不过,请尝试一下:

WITH cte AS (
   SELECT event_id, user_id, event_type, ts
        , row_number(*) OVER (PARTITION BY user_id, event_type
                              ORDER BY ts) AS rn
   FROM   events
   )
SELECT e.event_id, e.user_id, e.event_type, e.ts, e.rn - min(x.rn) + 1 AS ct
FROM   cte e
JOIN   cte x USING (user_id, event_type)
WHERE  x.ts >= e.ts - interval '30 days'
AND    x.ts <= e.ts
GROUP  BY e.event_id, e.user_id, e.event_type, e.ts, e.rn
ORDER  BY e.event_id;
Run Code Online (Sandbox Code Playgroud)

SQL Fiddle在 Postgres 9.3 中演示了所有内容。