如何在 Postgres 中获取窗口函数的聚合?

Sco*_*all 11 postgresql aggregate window-functions

我有一个表,其中包含两列整型数组的排列/组合,第三列包含一个值,如下所示:

CREATE TABLE foo
(
  perm integer[] NOT NULL,
  combo integer[] NOT NULL,
  value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )
Run Code Online (Sandbox Code Playgroud)

我想找出每个排列以及每个组合的平均值和标准偏差。我可以用这个查询做到这一点:

SELECT
  f1.perm,
  f2.combo,
  f1.perm_average_value,
  f2.combo_average_value,
  f1.perm_stddev,
  f2.combo_stddev,
  f1.perm_count,
  f2.combo_count
FROM
(
  SELECT
    perm,
    combo,
    avg( value ) AS perm_average_value,
    stddev_pop( value ) AS perm_stddev,
    count( * ) AS perm_count
  FROM foo
  GROUP BY perm, combo
) AS f1
JOIN
(
  SELECT
    combo,
    avg( value ) AS combo_average_value,
    stddev_pop( value ) AS combo_stddev,
    count( * ) AS combo_count
  FROM foo
  GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );
Run Code Online (Sandbox Code Playgroud)

但是,当我有大量数据时,该查询会变得非常慢,因为“foo”表(实际上由 14 个分区组成,每个分区大约有 400 万行)需要扫描两次。

最近,我了解到 Postgres 支持“窗口函数”,它基本上类似于特定列的 GROUP BY。我修改了我的查询以使用这些:

SELECT
  perm,
  combo,
  avg( value ) as perm_average_value,
  avg( avg( value ) ) over w_combo AS combo_average_value,
  stddev_pop( value ) as perm_stddev,
  stddev_pop( avg( value ) ) over w_combo as combo_stddev,
  count( * ) as perm_count,
  sum( count( * ) ) over w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS ( PARTITION BY combo );
Run Code Online (Sandbox Code Playgroud)

虽然这适用于“combo_count”列,但“combo_average_value”和“combo_stddev”列不再准确。似乎对每个排列取平均值,然后对每个组合进行第二次平均,这是不正确的。

我怎样才能解决这个问题?窗口函数甚至可以在这里用作优化吗?

Erw*_*ter 9

可以在单个查询级别中对聚合函数的结果使用窗口函数。

经过一些修改后,这一切都会很好地工作 - 除了它在数学原理上的标准偏差失败。所涉及的计算不是线性的,因此您不能简单地组合亚群的标准偏差。

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);
Run Code Online (Sandbox Code Playgroud)

因为combo_average_value你需要这个表达式

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo
Run Code Online (Sandbox Code Playgroud)

因为你需要一个加权平均值。(一个有 10 个成员的小组的平均值比一个只有 2 个成员的小组的平均值要重!)

这有效

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);
Run Code Online (Sandbox Code Playgroud)

我在这里使用了两个不同的窗口,并减少了DISTINCT即使在窗口函数之后应用的行。

但我严重怀疑它会比您的原始查询更快。我很确定它不是。

通过改变表格布局获得更好的性能

数组的开销为 24 字节(根据类型略有不同)。此外,您似乎每个数组都有相当多的项目和许多重复。对于像您这样的大表,规范化架构是值得的。示例布局:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);
Run Code Online (Sandbox Code Playgroud)

如果您不需要参照完整性,则可以省略外键约束。

连接combo_id也可以放在 table 中perm,但在这种情况下,我会将它(稍微非规范化)存储在其中value以获得更好的性能。

这将导致 32 字节的行大小(元组标题 + 填充:24 字节,2 x int(8 字节),无填充),加上numeric列的未知大小。(如果您不需要极高的精度,double precisionreal列甚至一列也可以。)

在 SO或此处的相关答案中有关物理存储的更多信息:
配置 PostgreSQL 以提高读取性能

无论如何,这只是您现在拥有的一小部分,并且仅通过大小就可以使您的查询更快。对简单整数进行分组和排序也快得多。

您将首先在子查询中聚合,然后加入permcombo以获得最佳性能。