通过性能改善分区?

Ski*_*rou 6 postgresql query-optimization database-performance window-functions

请考虑下表:

 foo | bar
-----+-----
  3  |  1
  8  |  1
  2  |  1
  8  |  5
  6  |  5
  5  |  5
  4  |  5
  5  |  7
  4  |  7
Run Code Online (Sandbox Code Playgroud)

foo包含任何内容.列几乎bar是有序的,并且共同值的行彼此跟随.表包含约170万行,每个不同值约15行.barbar

我觉得PARTITION BY很慢,我想知道我是否可以采取任何措施来改善其性能?

我试过CREATE INDEX bar_idx ON foobar(bar)但它对性能没有影响(IRL已经在表的另一列上有一个主键).我正在使用PostgreSQL 9.3.5.

以下是EXPLAIN ANALYZE有和没有的简单查询PARTITION BY :

> EXPLAIN ANALYZE SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
                                                           QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=262947.92..293133.35 rows=1724882 width=8) (actual time=2286.082..3504.372 rows=1724882 loops=1)
   ->  Sort  (cost=262947.92..267260.12 rows=1724882 width=8) (actual time=2286.068..2746.047 rows=1724882 loops=1)
         Sort Key: bar
         Sort Method: external merge  Disk: 27176kB
         ->  Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.019..441.827 rows=1724882 loops=1)
 Total runtime: 3606.695 ms
(6 lignes)

> EXPLAIN ANALYZE SELECT foo FROM foobar;
                                                     QUERY PLAN                                                 
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=4) (actual time=0.014..385.931 rows=1724882 loops=1)
 Total runtime: 458.776 ms
(2 lignes)
Run Code Online (Sandbox Code Playgroud)

第一个改进,增加work_mem:

在大多数情况下work_mem,正如hbn所建议的那样,增长应该有所帮助.在我的情况下,我正在使用SSD,因此切换到RAM(增加到work_mem1 GB)只会将处理时间减少1.5:

> EXPLAIN (ANALYZE, BUFFERS) SELECT foo OVER (PARTITION BY bar) FROM foobar;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=215781.92..245967.35 rows=1724882 width=8) (actual time=933.575..1931.656 rows=1724882 loops=1)
   Buffers: shared hit=2754 read=17098
   ->  Sort  (cost=215781.92..220094.12 rows=1724882 width=8) (actual time=933.558..1205.314 rows=1724882 loops=1)
         Sort Key: bar
         Sort Method: quicksort  Memory: 130006kB
         Buffers: shared hit=2754 read=17098
         ->  Seq Scan on foobar  (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.023..392.446 rows=1724882 loops=1)
               Buffers: shared hit=2754 read=17098
 Total runtime: 2051.494 ms
(9 lignes)
Run Code Online (Sandbox Code Playgroud)

第二项改进,使用CLUSTER :

我尝试了这篇文章的一些建议- 增加统计数据对我的案例没有显着影响.唯一一个帮助或尚未激活的是" 以索引的物理顺序重写表格 ",使用CLUSTER(您可能更喜欢pg_repack,阅读原始帖子):

> CLUSTER foobar USING bar_idx;
CLUSTER
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
                                                                  QUERY PLAN                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.43..150079.25 rows=1724882 width=8) (actual time=0.031..1372.416 rows=1724882 loops=1)
   Buffers: shared hit=64 read=24503
   ->  Index Scan using bar_idx on foobar  (cost=0.43..124206.02 rows=1724882 width=8) (actual time=0.018..581.665 rows=1724882 loops=1)
         Buffers: shared hit=64 read=24503
 Total runtime: 1484.974 ms
(5 lignes)
Run Code Online (Sandbox Code Playgroud)

第三项改进,表的子集:

在我的情况下,我最终需要在此表上选择另一个表,因此将表的子集创建为自己的表似乎是有意义的:

CREATE TABLE subfoobar AS (SELECT * FROM foobar WHERE bar IN (SELECT DISTINCT bar FROM othertable) ORDER BY bar);
Run Code Online (Sandbox Code Playgroud)

新表只有700k行而不是170万行,并且查询时间似乎(在重新创建索引之后bar)大致成比例,因此增益很大:

> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM subfoobar;
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.42..37455.61 rows=710173 width=8) (actual time=0.025..543.437 rows=710173 loops=1)
   Buffers: shared hit=10290
   ->  Index Scan using bar_sub_idx on subfoobar  (cost=0.42..26803.02 rows=710173 width=8) (actual time=0.015..222.211 rows=710173 loops=1)
         Buffers: shared hit=10290
 Total runtime: 590.063 ms
(5 lignes)
Run Code Online (Sandbox Code Playgroud)

第四项改进,总结表:

由于IRL窗口函数在查询中涉及多次,查询本身将被执行多次(数据挖掘),并且分区上的聚合结果将始终相同,我决定选择更有效的方法:我将所有这些值都提取到一个新的"汇总表"中(不确定我的定义是否与"官方"匹配?).

在我们简单的例子中,这将给出

CREATE TABLE summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;
Run Code Online (Sandbox Code Playgroud)

实际上,正如hbn在评论中所建议的那样,创建MATERIALIZED VIEW一个新表而不是新表更好,以便我们可以随时更新它REFRESH MATERIALIZED VIEW summary_foobar; :

CREATE MATERIALIZED VIEW summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;
Run Code Online (Sandbox Code Playgroud)

然后,将初始查询应用于我的真实案例表:

> EXPLAIN (ANALYZE, BUFFERS) SELECT cfoo FROM subfoobar,summary_foobar WHERE subfoobar.bar=summary_foobar.bar;
                                                          QUERY PLAN                                                      
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=1254.64..28939.67 rows=424685 width=73) (actual time=9.893..268.704 rows=370393 loops=1)
   Hash Cond: (subfoobar.bar = summary_foobar.bar)
   Buffers: shared hit=8916
   ->  Seq Scan on subfoobar  (cost=0.00..15448.73 rows=710173 width=4) (actual time=0.003..70.850 rows=710173 loops=1)
         Buffers: shared hit=8347
   ->  Hash  (cost=873.73..873.73 rows=30473 width=77) (actual time=9.872..9.872 rows=30473 loops=1)
         Buckets: 4096  Batches: 1  Memory Usage: 3347kB
         Buffers: shared hit=569
         ->  Seq Scan on summary_foobar  (cost=0.00..873.73 rows=30473 width=77) (actual time=0.003..4.569 rows=30473 loops=1)
               Buffers: shared hit=569
 Total runtime: 286.910 ms [~550 ms if using foobar instead of subfoobar]
(11 lignes)
Run Code Online (Sandbox Code Playgroud)

总而言之,对于我的实际案例查询,我从每个查询5000+毫秒下降到大约150毫秒(由于WHERE条款而少于示例).

hbn*_*hbn 3

你可能需要增加work_mem. 您的查询正在使用磁盘排序。它使用 27MB - 尝试设置work_mem为 64MB 左右,然后看看它的性能如何。

您可以在每个会话或事务中以及在 postgresql.conf 中设置它。

SET work_mem TO '64MB';
Run Code Online (Sandbox Code Playgroud)

将为您当前的会话设置它。

显然,合理的值取决于您的计算机中有多少 RAM 以及您期望拥有的并发连接数。