Whe*_*hee 5 postgresql aggregate postgresql-10
尝试生成 SQL 来计算给定百分位值集的加权连续值(下面使用的 25%、50% 和 75% 级别,但解决方案应允许任意参数级别)。换句话说,想要找到下面“源”表中测试数据的 25%、50% 和 75% 累积百分位数的插值“原始”值(按“cnt”加权)。
注意: 表示采样期间cnt该值出现的次数,预期输出将对该值进行加权以得出百分位数(类似于分位数/中位数和类似的统计数据)rawrawcnt
测试数据:(表:来源)
| site | dateval | raw | cnt |
+--------+------------+-------+---------+
| A | 2019-01-05 | 45 | 14 |
| A | 2019-01-05 | 52 | 178 |
| A | 2019-01-05 | 45 | 9 |
| A | 2019-01-05 | 37 | 75 |
| A | 2019-01-05 | 23 | 98 |
| A | 2019-01-05 | 78 | 102 |
| A | 2019-01-05 | 56 | 9 |
| A | 2019-01-05 | 17 | 54 |
| A | 2019-01-05 | 56 | 8 |
| A | 2019-01-06 | 33 | 35 |
| A | 2019-01-06 | 67 | 45 |
| A | 2019-01-06 | 65 | 93 |
| A | 2019-01-06 | 89 | 113 |
| A | 2019-01-06 | 52 | 64 |
| A | 2019-01-06 | 101 | 12 |
| B | 2019-01-05 | 5 | 25 |
| B | 2019-01-05 | 16 | 48 |
| B | 2019-01-05 | 12 | 107 |
| B | 2019-01-05 | 25 | 78 |
| B | 2019-01-05 | 44 | 53 |
| B | 2019-01-05 | 8 | 12 |
| B | 2019-01-05 | 31 | 32 |
| B | 2019-01-06 | 34 | 87 |
| B | 2019-01-06 | 18 | 35 |
| B | 2019-01-06 | 51 | 17 |
| B | 2019-01-06 | 22 | 23 |
| B | 2019-01-06 | 14 | 52 |
| B | 2019-01-06 | 6 | 34 |
+--------+------------+-------+---------+
Run Code Online (Sandbox Code Playgroud)
预期输出(四舍五入到最接近的 1/100):
| site | dateval | p00 | p25 | p50 | p75 | p100 |
+--------+------------+---------+---------+---------+---------+---------+
| A | 2019-01-05 | 17.00 | 22.07 | 45.92 | 51.30 | 78.00 |
| A | 2019-01-06 | 33.00 | 49.48 | 63.46 | 73.72 | 101.00 |
| B | 2019-01-05 | 5.00 | 9.93 | 14.79 | 24.57 | 44.00 |
| B | 2019-01-06 | 6.00 | 10.31 | 18.52 | 27.79 | 51.00 |
+--------+------------+---------+---------+---------+---------+---------+
Run Code Online (Sandbox Code Playgroud)
注意:上述结果假设raw值之间是线性平滑的。例如,p25的值22.07 = [ (25.00% - 54/547) / ((98+54)/547 - 54/547) ] * (23-17) + 17,其中547 = sum(cnt) | site='A' & dateval='2019-01-05'.
当前SQL
下面根据表“源”中存在的“原始”值计算离散点处的百分位值。然而,所需的输出是连续对应于给定百分位数的“原始”值(为了简单起见,离散“原始”级别之间的插值是线性的,而不是样条线/其他)。坦率地说,不确定以下方法是最合适的路径:
WITH raw_lvl AS (
SELECT "site", "dateval", "raw", sum("cnt") AS "sumcnt"
FROM source
GROUP BY "site", "dateval", "raw"
), cum_raw AS (
SELECT tlr.*, sum(tlr."sumcnt") OVER "win_cr" AS "cumsumcnt"
FROM raw_lvl AS "tlr"
WINDOW "win_cr" AS (PARTITION BY tlr."site", tlr."dateval" ORDER BY tlr."raw" ASC)
)
SELECT cr.*, cr."cumsumcnt"/(sum(cr."sumcnt") OVER "win_pr") AS "percentile"
FROM cum_raw AS cr
WINDOW "win_pr" AS (PARTITION BY cr."site", cr."dateval");
Run Code Online (Sandbox Code Playgroud)
Postgres 版本 10.3
Postgres 有有序集聚合函数来满足您的目的。
特殊困难:您希望行“加权”为cnt。如果这意味着每一行代表cnt相同的行,您可以通过连接到来乘以输入行generate_series(1, cnt):
SELECT site, dateval
, percentile_cont('{0,.25,.5,.75,1}'::float8[]) WITHIN GROUP (ORDER BY raw)
FROM source s, generate_series(1, s.cnt)
GROUP BY 1, 2;
Run Code Online (Sandbox Code Playgroud)
db<>在这里摆弄
但结果与您的预期输出不同(0 和 100 百分位数除外)。所以你的“权重”不同......
除此之外,您的原始查询可以简化为等价的:
SELECT site, dateval, raw, sum(cnt) AS sumcnt
, sum(sum(cnt)) OVER w AS cumsumcnt
, sum(sum(cnt)) OVER w / sum(sum(cnt)) OVER (PARTITION BY site, dateval) AS percentile
FROM source
GROUP BY site, dateval, raw
WINDOW w AS (PARTITION BY site, dateval ORDER BY raw);
Run Code Online (Sandbox Code Playgroud)
您可以对聚合函数的结果运行窗口函数SELECT(但反之则不然)。看:
我在上面的小提琴中添加了一个演示。
但两者都没有解释你的“预期结果”中的奇数。无论你如何插值,这些都让我觉得不正确。示例:22.07在第一行中, forp25似乎没有意义 -根据您自己的查询考虑因素后,该值23占据直到百分位的所有行......27.7879cnt