Sco*_*all 11 postgresql aggregate window-functions
我有一个表,其中包含两列整型数组的排列/组合,第三列包含一个值,如下所示:
CREATE TABLE foo
(
perm integer[] NOT NULL,
combo integer[] NOT NULL,
value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )
Run Code Online (Sandbox Code Playgroud)
我想找出每个排列以及每个组合的平均值和标准偏差。我可以用这个查询做到这一点:
SELECT
f1.perm,
f2.combo,
f1.perm_average_value,
f2.combo_average_value,
f1.perm_stddev,
f2.combo_stddev,
f1.perm_count,
f2.combo_count
FROM
(
SELECT
perm,
combo,
avg( value ) AS perm_average_value,
stddev_pop( value ) AS perm_stddev,
count( * ) AS perm_count
FROM foo
GROUP BY perm, combo
) AS f1
JOIN
(
SELECT
combo,
avg( value ) AS combo_average_value,
stddev_pop( value ) AS combo_stddev,
count( * ) AS combo_count
FROM foo
GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );
Run Code Online (Sandbox Code Playgroud)
但是,当我有大量数据时,该查询会变得非常慢,因为“foo”表(实际上由 14 个分区组成,每个分区大约有 400 万行)需要扫描两次。
最近,我了解到 Postgres 支持“窗口函数”,它基本上类似于特定列的 GROUP BY。我修改了我的查询以使用这些:
SELECT
perm,
combo,
avg( value ) as perm_average_value,
avg( avg( value ) ) over w_combo AS combo_average_value,
stddev_pop( value ) as perm_stddev,
stddev_pop( avg( value ) ) over w_combo as combo_stddev,
count( * ) as perm_count,
sum( count( * ) ) over w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS ( PARTITION BY combo );
Run Code Online (Sandbox Code Playgroud)
虽然这适用于“combo_count”列,但“combo_average_value”和“combo_stddev”列不再准确。似乎对每个排列取平均值,然后对每个组合进行第二次平均,这是不正确的。
我怎样才能解决这个问题?窗口函数甚至可以在这里用作优化吗?
您可以在单个查询级别中对聚合函数的结果使用窗口函数。
经过一些修改后,这一切都会很好地工作 - 除了它在数学原理上的标准偏差失败。所涉及的计算不是线性的,因此您不能简单地组合亚群的标准偏差。
SELECT perm
,combo
,avg(value) AS perm_average_value
,sum(avg(value) * count(*)) OVER w_combo /
sum(count(*)) OVER w_combo AS combo_average_value
,stddev_pop(value) AS perm_stddev
,0 AS combo_stddev -- doesn't work!
,count(*) AS perm_count
,sum(count(*)) OVER w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS (PARTITION BY combo);
Run Code Online (Sandbox Code Playgroud)
因为combo_average_value你需要这个表达式
sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo
Run Code Online (Sandbox Code Playgroud)
因为你需要一个加权平均值。(一个有 10 个成员的小组的平均值比一个只有 2 个成员的小组的平均值要重!)
这有效:
SELECT DISTINCT ON (perm, combo)
perm
,combo
,avg(value) OVER wpc AS perm_average_value
,avg(value) OVER wc AS combo_average_value
,stddev_pop(value) OVER wpc AS perm_stddev
,stddev_pop(value) OVER wc AS combo_stddev
,count(*) OVER wpc AS perm_count
,count(*) OVER wc AS combo_count
FROM foo
WINDOW wc AS (PARTITION BY combo)
,wpc AS (PARTITION BY perm, combo);
Run Code Online (Sandbox Code Playgroud)
我在这里使用了两个不同的窗口,并减少了DISTINCT即使在窗口函数之后应用的行。
但我严重怀疑它会比您的原始查询更快。我很确定它不是。
数组的开销为 24 字节(根据类型略有不同)。此外,您似乎每个数组都有相当多的项目和许多重复。对于像您这样的大表,规范化架构是值得的。示例布局:
CREATE TABLE combo (
combo_id serial PRIMARY KEY
,combo int[] NOT NULL
);
CREATE TABLE perm (
perm_id serial PRIMARY KEY
,perm int[] NOT NULL
);
CREATE TABLE value (
perm_id int REFERENCES perm(perm_id)
,combo_id int REFERENCES combo(combo_id)
,value numeric NOT NULL DEFAULT 0
);
Run Code Online (Sandbox Code Playgroud)
如果您不需要参照完整性,则可以省略外键约束。
连接combo_id也可以放在 table 中perm,但在这种情况下,我会将它(稍微非规范化)存储在其中value以获得更好的性能。
这将导致 32 字节的行大小(元组标题 + 填充:24 字节,2 x int(8 字节),无填充),加上numeric列的未知大小。(如果您不需要极高的精度,double precision一real列甚至一列也可以。)
在 SO或此处的相关答案中有关物理存储的更多信息:
配置 PostgreSQL 以提高读取性能
无论如何,这只是您现在拥有的一小部分,并且仅通过大小就可以使您的查询更快。对简单整数进行分组和排序也快得多。
您将首先在子查询中聚合,然后加入perm和combo以获得最佳性能。
| 归档时间: |
|
| 查看次数: |
13454 次 |
| 最近记录: |