提高 GROUP BY 子句中的排序性能

Jam*_*Hay 6 postgresql performance execution-plan group-by postgresql-9.4 postgresql-performance

我在 Postgres 9.4.1 中有两个表,eventsevent_refs具有以下模式:

events 桌子

CREATE TABLE events (
  id serial NOT NULL PRIMARY KEY,
  event_type text NOT NULL,
  event_path jsonb,
  event_data jsonb,
  created_at timestamp with time zone NOT NULL
);

-- Index on type and created time

CREATE INDEX events_event_type_created_at_idx
  ON events (event_type, created_at);
Run Code Online (Sandbox Code Playgroud)

event_refs 桌子

CREATE TABLE event_refs (
  event_id integer NOT NULL,
  reference_key text NOT NULL,
  reference_value text NOT NULL,
  CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
  CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
      REFERENCES events (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
);
Run Code Online (Sandbox Code Playgroud)

两个表都有 200 万行。这是我正在运行的查询

SELECT
  EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
  events
INNER JOIN
  event_refs
ON
  event_refs.event_id = events.id AND
  event_refs.reference_key = 'project'
WHERE
    events.event_type = 'event1' OR
    events.event_type = 'event2' AND
    events.created_at >= '2015-07-01 00:00:00+08:00' AND
    events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1
Run Code Online (Sandbox Code Playgroud)

我知道 where 子句中的运算符优先级。它只应该按日期过滤类型为“event2”的事件。

这是EXPLAIN ANALYZE输出

GroupAggregate  (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
   Group Key: event_refs.reference_value
   Filter: (count(*) > 1)
   Rows Removed by Filter: 41315
   ->  Sort  (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
         Sort Key: event_refs.reference_value
         Sort Method: external merge  Disk: 3904kB
         ->  Hash Join  (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
               Hash Cond: (event_refs.event_id = events.id)
               ->  Seq Scan on event_refs  (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
                     Filter: (reference_key = 'project'::text)
               ->  Hash  (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
                     Buckets: 16384  Batches: 2  Memory Usage: 3315kB
                     ->  Bitmap Heap Scan on events  (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
                           Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
                           Heap Blocks: exact=14911
                           ->  BitmapOr  (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
                                       Index Cond: (event_type = 'event1'::text)
                                 ->  Bitmap Index Scan on events_event_type_created_at_idx  (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
                                       Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
 Planning time: 0.493 ms
 Execution time: 4178.517 ms
Run Code Online (Sandbox Code Playgroud)

我知道event_refs表扫描的过滤器没有过滤任何东西,这是我的测试数据的结果,稍后会添加不同的类型。

一切都HashJoin似乎合理给我的测试数据,但我想知道是否有可能提高SortGROUP BY条款的速度?

我试过在reference_value列中添加一个 b 树索引,但它似乎没有使用它。如果我没记错的话(我很可能是这样,请告诉我),它正在对 153766 行进行排序。索引不会对这个排序过程有益吗?

Erw*_*ter 11

work_mem

这就是使您的排序昂贵的原因:

排序方式:外部合并磁盘:3904kB

排序会溢出到磁盘,这会降低性能。您需要更多内存。特别是,您需要增加 的设置work_mem手册:

work_mem( integer)

指定在写入临时磁盘文件之前由内部排序操作和哈希表使用的内存量。

在这种特殊情况下,将设置提高 4MB 应该可以解决问题。通常,由于在 60M 行的完整部署中需要更多,并且如果 的一般设置work_mem太高会适得其反(阅读我链接到的手册!),请考虑在本地将其设置得足够高查询,如:

BEGIN;
SET LOCAL work_mem = '500MB';  -- adapt to your needs
SELECT ...;
COMMIT;
Run Code Online (Sandbox Code Playgroud)

请注意,即使SET LOCAL一直到交易结束。如果您在同一交易中投入更多,您可能需要重置:

RESET work_mem;
Run Code Online (Sandbox Code Playgroud)

或者将查询封装在具有函数本地设置的函数中。相关答案与功能示例:

索引

我也会尝试这些索引:

CREATE INDEX events_event_type_created_at_idx ON events (event_type, created_at, id);
Run Code Online (Sandbox Code Playgroud)

id仅当您从中获得仅索引扫描时,添加为最后一列才有意义。看:

并且部分指标event_refs

CREATE INDEX event_refs_foo_idx ON event_refs (event_id, reference_value);
WHERE  reference_key = 'project';
Run Code Online (Sandbox Code Playgroud)

谓词WHERE reference_key = 'project'在您的测试用例中没有太大帮助(可能除了查询计划),但它应该对您的完整部署有很大帮助,其中there will be different types added later.

这也应该允许仅索引扫描。

可能的替代查询

由于您正在选择大部分内容events,因此此替代查询可能会更快(在很大程度上取决于数据分布):

SELECT EXTRACT(EPOCH FROM (MAX(e.created_at) - MIN(e.created_at))) as funnel_time
FROM   events e
JOIN  (
   SELECT event_id, reference_value, count(*) AS ct
   FROM   event_refs
   WHERE  reference_key = 'project'                   
   GROUP  BY event_id, reference_value
   ) r ON r.event_id = e.id
WHERE (e.event_type = 'event1' OR
       e.event_type = 'event2')        -- see below !
AND    e.created_at >= '2015-07-01 00:00:00+08:00'
AND    e.created_at <  '2015-12-01 00:00:00+08:00'
GROUP  BY r.reference_value
HAVING sum(r.ct) > 1;
Run Code Online (Sandbox Code Playgroud)

怀疑查询中存在错误,并且您希望WHERE像我添加的那样在子句中使用括号。根据运算符优先级AND绑定之前OR

仅当每个in有很多行时才有意义。同样,上述索引会有所帮助。(event_id, reference_value)event_refs