tuf*_*der 24 postgresql aggregate
在跨越 18 个月内跨越 1,000 多个实体的交易数据库中,我想运行一个查询,将每个可能的 30 天期间entity_id按其交易金额的总和和该 30 天内的交易计数进行分组,并且以我可以查询的方式返回数据。经过大量测试,这段代码完成了我想要的大部分内容:
SELECT id, trans_ref_no, amount, trans_date, entity_id,
SUM(amount) OVER(PARTITION BY entity_id, date_trunc('month',trans_date) ORDER BY entity_id, trans_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS trans_total,
COUNT(id) OVER(PARTITION BY entity_id, date_trunc('month',trans_date) ORDER BY entity_id, trans_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS trans_count
FROM transactiondb;
Run Code Online (Sandbox Code Playgroud)
我将在更大的查询中使用类似的结构:
SELECT * FROM (
SELECT id, trans_ref_no, amount, trans_date, entity_id,
SUM(amount) OVER(PARTITION BY entity_id, date_trunc('month',trans_date) ORDER BY entity_id, trans_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS trans_total,
COUNT(id) OVER(PARTITION BY entity_id, date_trunc('month',trans_date) ORDER BY entity_id, trans_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS trans_count
FROM transactiondb ) q
WHERE trans_count >= 4
AND trans_total >= 50000;
Run Code Online (Sandbox Code Playgroud)
此查询未涵盖的情况是交易计数将跨越多个月,但彼此之间仍不超过 30 天。Postgres 可以进行这种类型的查询吗?如果是这样,我欢迎任何意见。许多其他主题讨论“运行”聚合,而不是滚动。
该CREATE TABLE脚本:
CREATE TABLE transactiondb (
id integer NOT NULL,
trans_ref_no character varying(255),
amount numeric(18,2),
trans_date date,
entity_id integer
);
Run Code Online (Sandbox Code Playgroud)
示例数据可以在这里找到。我正在运行 PostgreSQL 9.1.16。
理想的输出将包括SUM(amount)与COUNT()在滚动30天期间的所有交易。看这张图,例如:

绿色日期突出显示表示我的查询包含的内容。突出显示的黄色行表示记录我希望成为该集合的一部分的内容。
以前的阅读:
Erw*_*ter 32
您可以使用WINDOW子句简化查询,但这只是缩短了语法,而不是更改查询计划。
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
ORDER BY trans_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
Run Code Online (Sandbox Code Playgroud)
count(*),既然id是肯定定义的NOT NULL?ORDER BY entity_id因为你已经PARTITION BY entity_id不过,您可以进一步简化:
根本不要添加ORDER BY到窗口定义中,它与您的查询无关。那么您也不需要定义自定义窗口框架:
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);
Run Code Online (Sandbox Code Playgroud)
更简单、更快,但仍然只是您拥有的更好版本,静态月份。
... 没有明确定义,所以我将基于这些假设:
在任何entity_id. 排除没有活动的前导和尾随期间,但包括这些外部界限内所有可能的 30 天期间。
SELECT entity_id, trans_date
, COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
, COALESCE(sum(daily_count) OVER w, 0) AS trans_count
FROM (
SELECT entity_id
, generate_series (min(trans_date)::timestamp
, GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
, interval '1 day')::date AS trans_date
FROM transactiondb
GROUP BY 1
) x
LEFT JOIN (
SELECT entity_id, trans_date
, sum(amount) AS daily_amount, count(*) AS daily_count
FROM transactiondb
GROUP BY 1, 2
) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);
Run Code Online (Sandbox Code Playgroud)
这列出了每个 30 天的时间段entity_id以及您的汇总和trans_date该时间段的第一天(包括)。要再次获取每个单独行的值连接到基表...
基本难度与此处讨论的相同:
窗口的框架定义不能依赖于当前行的值。
而是generate_series()通过timestamp输入调用:
问题更新和讨论后:在每次实际交易开始的 30 天窗口中
累积相同的行entity_id。
由于您的数据分布稀疏,因此运行具有范围条件的自LATERAL联接应该更有效,因为 Postgres 9.1 还没有联接:
SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
, sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM transactiondb t0
JOIN transactiondb t1 USING (entity_id)
WHERE t1.trans_date >= t0.trans_date
AND t1.trans_date < t0.trans_date + 30 -- exclude upper bound
-- AND t0.entity_id = 114284 -- or pick a single entity ...
GROUP BY t0.id -- is PK!
ORDER BY t0.trans_date, t0.id
Run Code Online (Sandbox Code Playgroud)
滚动窗口仅对大多数日子的数据有意义(就性能而言)。
这确实不是对总重复(trans_date, entity_id)每一天,但当天的所有行始终包含在30天的窗口。
对于大表,像这样的覆盖索引可能会有所帮助:
CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);
Run Code Online (Sandbox Code Playgroud)
最后一列amount只有在您从中获得仅索引扫描时才有用。否则放下它。
但是无论如何都不会在您选择整个表格时使用它。它将支持对一个小子集的查询。
| 归档时间: |
|
| 查看次数: |
88202 次 |
| 最近记录: |