PostgreSQL 50M+行表聚合查询

doc*_*why 5 sql postgresql aggregate query-optimization

问题陈述

\n\n

我有表“event_statistics”,其定义如下:

\n\n
CREATE TABLE public.event_statistics (\n    id int4 NOT NULL DEFAULT nextval(\'event_statistics_id_seq\'::regclass),\n    client_id int4 NULL,\n    session_id int4 NULL,\n    action_name text NULL,\n    value text NULL,\n    product_id int8 NULL,\n    product_options jsonb NOT NULL DEFAULT \'{}\'::jsonb,\n    url text NULL,\n    url_options jsonb NOT NULL DEFAULT \'{}\'::jsonb,\n    visit int4 NULL DEFAULT 0,\n    date_update timestamptz NULL,\nCONSTRAINT event_statistics_pkey PRIMARY KEY (id),\nCONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for \nFOREIGN KEY \n(client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE\n)\nWITH (\n    OIDS=FALSE\n) ;\nCREATE INDEX regdate ON public.event_statistics (date_update \ntimestamptz_ops) ;\n
Run Code Online (Sandbox Code Playgroud)\n\n

和表“客户”:

\n\n
CREATE TABLE public.clients (\n    id int4 NOT NULL DEFAULT nextval(\'clients_id_seq\'::regclass),\n    client_name text NULL,\n    client_hash text NULL,\nCONSTRAINT clients_pkey PRIMARY KEY (id)\n)\nWITH (\n    OIDS=FALSE\n) ;\nCREATE INDEX clients_client_name_idx ON public.clients (client_name \ntext_ops) ;\n
Run Code Online (Sandbox Code Playgroud)\n\n

我需要的是获取每个“action_name”类型的“event_statistics”表中的事件计数,以按“action_name”和特定时间步以及特定客户端的所有这些对特定“date_update”范围进行分组。

\n\n

目标是在我们网站上的仪表板上为每个客户提供所有相关事件的统计数据,并可以选择报告日期,并且根据图表中的间隔时间步骤应该有所不同,例如:

\n\n
    \n
  • 当前日期 \xe2\x80\x94 每小时计数;
  • \n
  • 每天计算 1+ 天且 <= 1 个月 \xe2\x80\x94;
  • \n
  • 每周 1+ 个月且 <= 6 个月 \xe2\x80\x94 计数;
  • \n
  • 6+ 个月 \xe2\x80\x94 个月。
  • \n
\n\n

我做了什么:

\n\n
SELECT t.date, A.actionName, count(E.id)\nFROM generate_series(current_date - interval \'1 week\',now(),interval \'1 \nday\') as t(date) cross join\n(values\n(\'page_open\'),\n(\'product_add\'),\n(\'product_buy\'),\n(\'product_event\'),\n(\'product_favourite\'),\n(\'product_open\'),\n(\'product_share\'),\n(\'session_start\')) as A(actionName) left join\n(select action_name,date_trunc(\'day\',e.date_update) as dateTime, e.id \nfrom event_statistics as e \nwhere e.client_id = (select id from clients as c where c.client_name = \n\'client name\') and \n(date_update between (current_date - interval \'1 week\') and now())) E \non t.date = E.dateTime and A.actionName = E.action_name\ngroup by A.actionName,t.date\norder by A.actionName,t.date;\n
Run Code Online (Sandbox Code Playgroud)\n\n

按事件类型和上周的天数来统计事件的时间太长,超过 10 秒。我需要它能够更快地在更广泛的时间内完成相同的操作,例如具有不同组间隔的几周、几个月、几年(当天的每小时、每月的几天,然后是几周、几个月)。

\n\n

查询计划:

\n\n
GroupAggregate  (cost=171937.16..188106.84 rows=1600 width=44)\n  Group Key: "*VALUES*".column1, t.date\n  InitPlan 1 (returns $0)\n    ->  Seq Scan on clients c  (cost=0.00..1.07 rows=1 width=4)\n          Filter: (client_name = \'client name\'::text)\n  ->  Merge Left Join  (cost=171936.08..183784.31 rows=574060 width=44)\n        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc(\'day\'::text, e.date_update))))\n        ->  Sort  (cost=628.77..648.77 rows=8000 width=40)\n              Sort Key: "*VALUES*".column1, t.date\n              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40)\n                    ->  Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8)\n                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32)\n                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32)\n        ->  Materialize  (cost=171307.32..171881.38 rows=114812 width=24)\n              ->  Sort  (cost=171307.32..171594.35 rows=114812 width=24)\n                    Sort Key: e.action_name, (date_trunc(\'day\'::text, e.date_update))\n                    ->  Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24)\n                          Index Cond: ((date_update > ((\'now\'::cstring)::date - \'7 days\'::interval)) AND (date_update <= now()))\n                          Filter: (client_id = $0)\n
Run Code Online (Sandbox Code Playgroud)\n\n

“event_statistics”表有超过5000万行,它只会随着客户端的添加而增长,并且记录不会更改。

\n\n

我已经尝试了很多不同的查询计划和索引,但在聚合更广泛的日期范围时无法达到可接受的速度。\n我花了整整一周的时间学习这个问题的不同方面以及在 stackoverflow 和上解决这个问题的方法一些博客,但仍然不确定最好的方法是什么:

\n\n
    \n
  • 按 client_id 或日期范围分区
  • \n
  • 预先聚合到单独的结果表,然后每天更新它(也不确定如何做到最好?在插入原始表时触发或为该结果或物化视图安排一个单独的应用程序或通过来自网站的请求)
  • \n
  • 将数据库架构设计更改为每个客户端的架构或应用分片
  • \n
  • 更改服务器硬件(CPU Intel Xeon E7-4850 2.00GHz,RAM 6GB,它是 Web 应用程序和数据库的主机)
  • \n
  • 使用不同的数据库进行分析,并具有 Postgres-XL 等 OLAP 功能\或其他功能?
  • \n
\n\n

我还尝试了 event_statistics 上的 btree 索引(client_id asc、action_name asc、date_update asc、id)。仅索引扫描速度更快,但仍然不够,而且在磁盘空间使用方面也不是很好。

\n\n

解决这个问题的最佳方法是什么?

\n\n

更新

\n\n

根据要求,命令的输出explain (analyze, verbose)

\n\n
GroupAggregate  (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1)\n  Output: t.date, "*VALUES*".column1, count(e.id)\n  Group Key: "*VALUES*".column1, t.date\n  InitPlan 1 (returns $0)\n    ->  Seq Scan on public.clients c  (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1)\n          Output: c.id\n          Filter: (c.client_name = \'client name\'::text)\n          Rows Removed by Filter: 5\n  ->  Merge Left Join  (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1)\n        Output: t.date, "*VALUES*".column1, e.id\n        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc(\'day\'::text, e.date_update))))\n        ->  Sort  (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1)\n              Output: t.date, "*VALUES*".column1\n              Sort Key: "*VALUES*".column1, t.date\n              Sort Method: quicksort  Memory: 30kB\n              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1)\n                    Output: t.date, "*VALUES*".column1\n                    ->  Function Scan on pg_catalog.generate_series t  (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1)\n                          Output: t.date\n                          Function Call: generate_series((((\'now\'::cstring)::date - \'7 days\'::interval))::timestamp with time zone, now(), \'1 day\'::interval)\n                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8)\n                          Output: "*VALUES*".column1\n                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1)\n                                Output: "*VALUES*".column1\n        ->  Materialize  (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1)\n              Output: e.id, e.date_update, e.action_name, (date_trunc(\'day\'::text, e.date_update))\n              ->  Sort  (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1)\n                    Output: e.id, e.date_update, e.action_name, (date_trunc(\'day\'::text, e.date_update))\n                    Sort Key: e.action_name, (date_trunc(\'day\'::text, e.date_update))\n                    Sort Method: external merge  Disk: 39080kB\n                    ->  Index Scan using regdate on public.event_statistics e  (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1)\n                          Output: e.id, e.date_update, e.action_name, date_trunc(\'day\'::text, e.date_update)\n                          Index Cond: ((e.date_update >= ((\'now\'::cstring)::date - \'7 days\'::interval)) AND (e.date_update <= now()))\n                          Filter: (e.client_id = $0)\n                          Rows Removed by Filter: 2983424\nPlanning time: 7.278 ms\nExecution time: 54708.041 ms\n
Run Code Online (Sandbox Code Playgroud)\n

wil*_*ser 1

第一步:在子查询中进行预聚合:


EXPLAIN
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1 
day') as cal(theday) -- calendar pseudo-table
CROSS JOIN (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) AS act(action_name)
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) as theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics as es
        WHERE es.client_id = (SELECT c.id FROM clients AS c
                        WHERE c.client_name = 'client name')
        AND (es.date_update BETWEEN (current_date - interval '1 week') AND now())
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;
Run Code Online (Sandbox Code Playgroud)

下一步:将 VALUES 放入 CTE 中,并在聚合子查询中引用它。(增益取决于可以跳过的动作名称的数量)


EXPLAIN
WITH act(action_name) AS (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        )
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1day') AS cal(theday)
CROSS JOIN act
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name)
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;
Run Code Online (Sandbox Code Playgroud)

更新:使用物理(临时)表将得到更好的估计。


    -- Final attempt: materialize the carthesian product (timeseries*action_name)
    -- into a temp table
CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );
CREATE UNIQUE INDEX ON grid(action_name, theday);

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT grid.action_name, grid.theday, SUM(sub.the_count) AS the_count
FROM grid
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(*) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name)
        GROUP BY 1,2
        ORDER BY 1,2 --nonsense!
        ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name
GROUP BY grid.action_name,grid.theday
ORDER BY grid.action_name,grid.theday
        ;
Run Code Online (Sandbox Code Playgroud)

更新#3(抱歉,我在此处的基表上创建索引,您需要编辑。我还删除了时间戳上的一列)


    -- attempt#4:
    -- - materialize the carthesian product (timeseries*action_name)
    -- - sanitize date interval -logic

CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday::date
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
-- CREATE UNIQUE INDEX ON grid(action_name, theday);
-- CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
CREATE UNIQUE INDEX ON grid(theday, action_name);
CREATE INDEX iii ON event_statistics (date_update, action_name) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT gr.action_name, gr.theday
            , COUNT(*) AS the_count
FROM grid gr
LEFT JOIN event_statistics AS es
    ON es.action_name = gr.action_name
    AND date_trunc('day',es.date_update)::date = gr.theday
    AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date
JOIN clients cli  ON cli.id= es.client_id AND cli.client_name = 'client name'
GROUP BY gr.action_name,gr.theday
ORDER BY 1,2
        ;
Run Code Online (Sandbox Code Playgroud)
                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8.33..8.35 rows=1 width=17)
   Group Key: gr.action_name, gr.theday
   ->  Sort  (cost=8.33..8.34 rows=1 width=17)
         Sort Key: gr.action_name, gr.theday
         ->  Nested Loop  (cost=1.40..8.33 rows=1 width=17)
               ->  Nested Loop  (cost=1.31..7.78 rows=1 width=40)
                     Join Filter: (es.client_id = cli.id)
                     ->  Index Scan using clients_client_name_key on clients cli  (cost=0.09..2.30 rows=1 width=4)
                           Index Cond: (client_name = 'client name'::text)
                     ->  Bitmap Heap Scan on event_statistics es  (cost=1.22..5.45 rows=5 width=44)
                           Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
                           ->  Bitmap Index Scan on iii  (cost=0.00..1.22 rows=5 width=0)
                                 Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
               ->  Index Only Scan using grid_theday_action_name_idx on grid gr  (cost=0.09..0.54 rows=1 width=17)
                     Index Cond: ((theday = (date_trunc('day'::text, es.date_update))::date) AND (action_name = es.action_name))
(15 rows)
Run Code Online (Sandbox Code Playgroud)