使用增量数据有效计算子查询上的聚合函数

ank*_*shg 6 postgresql performance postgresql-9.3 query-performance

我有一个 PostgreSQL 数据库(9.3.6 版),其中包含大量orders. 随着订单被处理、scan_events触发和存储,每个订单有多个事件。扫描事件有一个布尔值,指示在该事件之后是否可以将订单标记为已完成,但complete可以发生多次扫描。然而,一般来说,我真正关心的唯一扫描是第一次扫描和第一次completed扫描。

我想知道在创建后x天内收到第一次扫描的给定创建日期的订单百分比的平均值和标准差。

架构

CREATE TABLE orders (
    id character varying(40) NOT NULL,
    date_created timestamp with time zone NOT NULL
);

ALTER TABLE ONLY orders
    ADD CONSTRAINT orders_pkey PRIMARY KEY (id);

CREATE TABLE scan_events (
    id character varying(100) NOT NULL,
    order_id character varying(40) NOT NULL,
    "time" timestamp with time zone NOT NULL
);

CREATE INDEX scan_events_order_id_idx ON scan_events USING btree (order_id);
Run Code Online (Sandbox Code Playgroud)

所需的计算

对于days_elapsed1 到 14 天不等的值,我需要以下平均值和标准差:

  1. 过去 30 天内每一天收到任何扫描的订单百分比(也就是按 分组)days_elapsedorders.date_createdDATE(orders.date_created)

  2. 过去 30 天(也就是按 分组)completed = TRUEdays_elapsed几天内收到扫描的订单的百分比orders.date_createdDATE(orders.date_created)

理想情况下,输出看起来像这样,但老实说,任何性能都很好。

output
----------------
days_elapsed
mean_scanned
stddev_scanned
mean_completed
stddev_completed
Run Code Online (Sandbox Code Playgroud)

现在的进展

我有一个查询可以让我获得每天的计数(可选择使用WHERE scan_events.completed IS TRUE以获取完整的扫描结果):

SELECT DATE(orders.date_created),
    COUNT(DISTINCT orders.id) AS total, 
    COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
    COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
FROM orders
LEFT JOIN scan_events ON orders.id = scan_events.order_id
WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
GROUP BY DATE(orders.date_created)
ORDER BY DATE(orders.date_created) ASC;
Run Code Online (Sandbox Code Playgroud)

对于days_elapsed = 1,这个查询大致是我想象的应该工作:

SELECT AVG(counts.scanned * 1.0 / counts.total) AS mean_scanned,
    STDDEV(counts.scanned * 1.0 / counts.total) AS stddev_scanned,
    AVG(counts.completed * 1.0 / counts.total) AS mean_completed,
    STDDEV(counts.completed * 1.0 / counts.total) AS stddev_completed
FROM ( 
    SELECT DATE(orders.date_created),
        COUNT(DISTINCT orders.id) AS total, 
        COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS scanned,
        COUNT(DISTINCT CASE WHEN scan_events.id IS NOT NULL AND scan_events.completed AND DATE_PART('day', scan_events.time - orders.date_created) <= 1 THEN orders.id ELSE NULL END) AS completed
    FROM orders
    LEFT JOIN scan_events ON orders.id = scan_events.order_id
    WHERE orders.date_created BETWEEN '2015-07-01' AND '2015-07-31'
    GROUP BY DATE(orders.date_created)
) counts
Run Code Online (Sandbox Code Playgroud)

这样做的问题是,它肯定是在重做它不需要的工作......

我想我们可以利用的东西:

  • AVGSTDDEV忽略null值所以我们可以做一些CASE WHEN ... THEN ... END诡计
  • 该套AVGSTDDEV计算上也许可以被逐步建立起来,我们增加days_elapsed一个

任何帮助将不胜感激——我的 SQL-fu 不符合标准 :(

Mic*_*een 1

已知值的数量、值的总和、值的平方和即可计算标准差。当新值到达并存储在工作表中时,每一个都可以增量更新。工作台看起来像

DailyTotals (
  OrderDate,
  NumberOfValues,
  SumOfValues,
  SumOfSquareOfValues);
Run Code Online (Sandbox Code Playgroud)

由于工作表是按日期键入的,因此可以实现您想要的十四天滑动窗口。由于每个值都是总和,因此对每个日期的值再次求和并不是数学问题。是的,运行时有一个计算。然而,它比完全标准差的轻得多。

当新值到达时,工作表可以同步更新(这是 1 行更新),也可以异步或批量更新,具体取决于应用程序对陈旧数据的需求。