Ski*_*rou 6 postgresql query-optimization database-performance window-functions
请考虑下表:
foo | bar
-----+-----
3 | 1
8 | 1
2 | 1
8 | 5
6 | 5
5 | 5
4 | 5
5 | 7
4 | 7
Run Code Online (Sandbox Code Playgroud)
列foo包含任何内容.列几乎bar是有序的,并且共同值的行彼此跟随.表包含约170万行,每个不同值约15行.barbar
我觉得PARTITION BY很慢,我想知道我是否可以采取任何措施来改善其性能?
我试过CREATE INDEX bar_idx ON foobar(bar)但它对性能没有影响(IRL已经在表的另一列上有一个主键).我正在使用PostgreSQL 9.3.5.
以下是EXPLAIN ANALYZE有和没有的简单查询PARTITION BY :
> EXPLAIN ANALYZE SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=262947.92..293133.35 rows=1724882 width=8) (actual time=2286.082..3504.372 rows=1724882 loops=1)
-> Sort (cost=262947.92..267260.12 rows=1724882 width=8) (actual time=2286.068..2746.047 rows=1724882 loops=1)
Sort Key: bar
Sort Method: external merge Disk: 27176kB
-> Seq Scan on foobar (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.019..441.827 rows=1724882 loops=1)
Total runtime: 3606.695 ms
(6 lignes)
> EXPLAIN ANALYZE SELECT foo FROM foobar;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..37100.82 rows=1724882 width=4) (actual time=0.014..385.931 rows=1724882 loops=1)
Total runtime: 458.776 ms
(2 lignes)
Run Code Online (Sandbox Code Playgroud)
在大多数情况下work_mem,正如hbn所建议的那样,增长应该有所帮助.在我的情况下,我正在使用SSD,因此切换到RAM(增加到work_mem1 GB)只会将处理时间减少1.5:
> EXPLAIN (ANALYZE, BUFFERS) SELECT foo OVER (PARTITION BY bar) FROM foobar;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=215781.92..245967.35 rows=1724882 width=8) (actual time=933.575..1931.656 rows=1724882 loops=1)
Buffers: shared hit=2754 read=17098
-> Sort (cost=215781.92..220094.12 rows=1724882 width=8) (actual time=933.558..1205.314 rows=1724882 loops=1)
Sort Key: bar
Sort Method: quicksort Memory: 130006kB
Buffers: shared hit=2754 read=17098
-> Seq Scan on foobar (cost=0.00..37100.82 rows=1724882 width=8) (actual time=0.023..392.446 rows=1724882 loops=1)
Buffers: shared hit=2754 read=17098
Total runtime: 2051.494 ms
(9 lignes)
Run Code Online (Sandbox Code Playgroud)
CLUSTER :我尝试了这篇文章的一些建议- 增加统计数据对我的案例没有显着影响.唯一一个帮助或尚未激活的是" 以索引的物理顺序重写表格 ",使用CLUSTER(您可能更喜欢pg_repack,阅读原始帖子):
> CLUSTER foobar USING bar_idx;
CLUSTER
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM foobar;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=0.43..150079.25 rows=1724882 width=8) (actual time=0.031..1372.416 rows=1724882 loops=1)
Buffers: shared hit=64 read=24503
-> Index Scan using bar_idx on foobar (cost=0.43..124206.02 rows=1724882 width=8) (actual time=0.018..581.665 rows=1724882 loops=1)
Buffers: shared hit=64 read=24503
Total runtime: 1484.974 ms
(5 lignes)
Run Code Online (Sandbox Code Playgroud)
在我的情况下,我最终需要在此表上选择另一个表,因此将表的子集创建为自己的表似乎是有意义的:
CREATE TABLE subfoobar AS (SELECT * FROM foobar WHERE bar IN (SELECT DISTINCT bar FROM othertable) ORDER BY bar);
Run Code Online (Sandbox Code Playgroud)
新表只有700k行而不是170万行,并且查询时间似乎(在重新创建索引之后bar)大致成比例,因此增益很大:
> EXPLAIN (ANALYZE, BUFFERS) SELECT count(foo) OVER (PARTITION BY bar) FROM subfoobar;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------
WindowAgg (cost=0.42..37455.61 rows=710173 width=8) (actual time=0.025..543.437 rows=710173 loops=1)
Buffers: shared hit=10290
-> Index Scan using bar_sub_idx on subfoobar (cost=0.42..26803.02 rows=710173 width=8) (actual time=0.015..222.211 rows=710173 loops=1)
Buffers: shared hit=10290
Total runtime: 590.063 ms
(5 lignes)
Run Code Online (Sandbox Code Playgroud)
由于IRL窗口函数在查询中涉及多次,查询本身将被执行多次(数据挖掘),并且分区上的聚合结果将始终相同,我决定选择更有效的方法:我将所有这些值都提取到一个新的"汇总表"中(不确定我的定义是否与"官方"匹配?).
在我们简单的例子中,这将给出
CREATE TABLE summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;
Run Code Online (Sandbox Code Playgroud)
实际上,正如hbn在评论中所建议的那样,创建MATERIALIZED VIEW一个新表而不是新表更好,以便我们可以随时更新它REFRESH MATERIALIZED VIEW summary_foobar; :
CREATE MATERIALIZED VIEW summary_foobar AS SELECT DISTINCT ON (bar) count(foo) OVER (PARTITION BY bar) AS cfoo, bar FROM foobar;
Run Code Online (Sandbox Code Playgroud)
然后,将初始查询应用于我的真实案例表:
> EXPLAIN (ANALYZE, BUFFERS) SELECT cfoo FROM subfoobar,summary_foobar WHERE subfoobar.bar=summary_foobar.bar;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=1254.64..28939.67 rows=424685 width=73) (actual time=9.893..268.704 rows=370393 loops=1)
Hash Cond: (subfoobar.bar = summary_foobar.bar)
Buffers: shared hit=8916
-> Seq Scan on subfoobar (cost=0.00..15448.73 rows=710173 width=4) (actual time=0.003..70.850 rows=710173 loops=1)
Buffers: shared hit=8347
-> Hash (cost=873.73..873.73 rows=30473 width=77) (actual time=9.872..9.872 rows=30473 loops=1)
Buckets: 4096 Batches: 1 Memory Usage: 3347kB
Buffers: shared hit=569
-> Seq Scan on summary_foobar (cost=0.00..873.73 rows=30473 width=77) (actual time=0.003..4.569 rows=30473 loops=1)
Buffers: shared hit=569
Total runtime: 286.910 ms [~550 ms if using foobar instead of subfoobar]
(11 lignes)
Run Code Online (Sandbox Code Playgroud)
总而言之,对于我的实际案例查询,我从每个查询5000+毫秒下降到大约150毫秒(由于WHERE条款而少于示例).
你可能需要增加work_mem. 您的查询正在使用磁盘排序。它使用 27MB - 尝试设置work_mem为 64MB 左右,然后看看它的性能如何。
您可以在每个会话或事务中以及在 postgresql.conf 中设置它。
SET work_mem TO '64MB';
Run Code Online (Sandbox Code Playgroud)
将为您当前的会话设置它。
显然,合理的值取决于您的计算机中有多少 RAM 以及您期望拥有的并发连接数。
| 归档时间: |
|
| 查看次数: |
1723 次 |
| 最近记录: |