Calculate 40 day moving average w.r.t to a field

Question

Calculate 40 day moving average w.r.t to a field

lpr*_*ani 6 postgresql window-functions postgresql-9.4

I have a table that stores the information about user calls in a call center. The table has a call_id, date when the call was made, actual date and time of the call, call type and a score associated with the call.

My requirement is to calculate a 40 day moving average of the score with respect to the call day. The 40 day should start from the previous day from the call date. If there are no call in the past 40 days it should include rows for the call date for which the moving average is being calculated.

Below is sample data:

select * from test_aes;

Run Code Online (Sandbox Code Playgroud)

Output:

call_id | call_dt_key | call_type_id |       call_dt_tm       | aes_raw
   1    | 2016-01-01  | CT1          | 2016-01-01 00:00:10-08 |      10
   2    | 2016-01-01  | CT1          | 2016-01-01 00:00:20-08 |      20
   3    | 2016-01-01  | CT1          | 2016-01-01 00:00:30-08 |      10
   4    | 2016-01-01  | CT1          | 2016-01-01 00:00:40-08 |      20
   5    | 2016-01-01  | CT1          | 2016-01-01 00:00:50-08 |      10
   6    | 2016-01-01  | CT1          | 2016-01-01 00:01:00-08 |      20
   7    | 2016-01-01  | CT1          | 2016-01-01 00:02:00-08 |      10
   8    | 2016-01-01  | CT1          | 2016-01-01 00:03:00-08 |      20
   9    | 2016-01-01  | CT1          | 2016-01-01 00:04:00-08 |      10
  10    | 2016-01-01  | CT1          | 2016-01-01 00:05:00-08 |      20
  11    | 2016-01-05  | CT1          | 2016-01-05 00:00:10-08 |      10
  12    | 2016-01-05  | CT1          | 2016-01-05 00:00:20-08 |      10
  13    | 2016-01-05  | CT1          | 2016-01-05 00:00:30-08 |      20
  14    | 2016-01-05  | CT1          | 2016-01-05 00:00:40-08 |      20
  15    | 2016-01-05  | CT1          | 2016-01-05 00:00:50-08 |      20
  16    | 2016-01-10  | CT1          | 2016-01-10 00:00:10-08 |      10
  17    | 2016-01-10  | CT1          | 2016-01-10 00:00:20-08 |      20
  18    | 2016-01-15  | CT1          | 2016-01-15 00:00:10-08 |      10
  19    | 2016-01-15  | CT1          | 2016-01-15 00:00:20-08 |      20
  20    | 2016-01-15  | CT1          | 2016-01-15 00:00:30-08 |      20
  21    | 2016-01-16  | CT1          | 2016-01-16 00:00:10-08 |      20
  22    | 2016-01-16  | CT1          | 2016-01-16 00:00:20-08 |      10
  23    | 2016-01-16  | CT1          | 2016-01-16 00:00:30-08 |      20
  24    | 2016-01-20  | CT1          | 2016-01-20 00:00:10-08 |      20
  25    | 2016-01-20  | CT1          | 2016-01-20 00:00:20-08 |      10
  26    | 2016-01-21  | CT1          | 2016-01-21 00:00:10-08 |      10
  27    | 2016-01-21  | CT1          | 2016-01-21 00:00:20-08 |      20
  28    | 2016-01-31  | CT1          | 2016-01-31 00:00:10-08 |      10
  29    | 2016-01-31  | CT1          | 2016-01-31 00:00:20-08 |      20
  30    | 2016-02-01  | CT1          | 2016-02-01 00:00:10-08 |      10
  31    | 2016-02-01  | CT1          | 2016-02-01 00:00:20-08 |      20
  32    | 2016-02-10  | CT1          | 2016-02-10 00:00:10-08 |      10
  33    | 2016-02-10  | CT1          | 2016-02-10 00:00:20-08 |      20
  34    | 2016-02-15  | CT1          | 2016-02-15 00:00:15-08 |      10
  35    | 2016-02-15  | CT1          | 2016-02-15 00:00:20-08 |      20
  36    | 2016-02-26  | CT1          | 2016-02-26 00:00:15-08 |      10
  37    | 2016-02-26  | CT1          | 2016-02-26 00:00:20-08 |      20
  38    | 2016-03-04  | CT1          | 2016-03-04 00:00:15-08 |      10
  39    | 2016-03-04  | CT1          | 2016-03-04 00:00:20-08 |      20
  40    | 2016-03-18  | CT1          | 2016-03-18 00:00:15-07 |      10
  41    | 2016-03-18  | CT1          | 2016-03-18 00:00:20-07 |      20

Run Code Online (Sandbox Code Playgroud)

Thus the output should be:

select * from test_aes;

Run Code Online (Sandbox Code Playgroud)

Schema and test data at below link: SQL Fiddle

I cannot use ROWS in an AVG window definition because test_aes has thousands of rows for a given day.

Answer 1

Vla*_*nov 7

从问题中并不清楚call_type_id列的作用是什么。在你澄清之前，我会忽略它。

没有窗口函数

这是一个根本不使用窗口函数的简单变体。

确保上有索引(call_dt_key, aes_raw)。

CTE_Dates返回表中所有日期的列表并计算每天的平均值。这average_current_day将是第一天所需要的。服务器会以任何方式扫描整个索引，因此计算这样的平均值很便宜。

然后，对于每个不同的日子，我使用自联接来计算前 40 天的平均值。这将返回NULL第一天，average_current_day在主查询中替换为。

您不必在这里使用 CTE，它只是使查询更易于阅读。

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,call_dt_key - INTERVAL '41 day' AS dt_from
        ,call_dt_key - INTERVAL '1 day' AS dt_to
        ,AVG(test_aes.aes_raw) AS average_current_day
    FROM test_aes
    GROUP BY call_dt_key
)
SELECT
    CTE_Dates.call_dt_key
    ,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40
FROM
    CTE_Dates
    LEFT JOIN LATERAL
    (
        SELECT AVG(test_aes.aes_raw) AS average_40
        FROM test_aes
        WHERE
                test_aes.call_dt_key >= CTE_Dates.dt_from
            AND test_aes.call_dt_key <= CTE_Dates.dt_to
    ) AS prev40 ON true
ORDER BY call_dt_key;

Run Code Online (Sandbox Code Playgroud)

结果

|                call_dt_key |         average_40 |
|----------------------------|--------------------|
|  January, 01 2016 00:00:00 |                 15 |
|  January, 05 2016 00:00:00 |                 15 |
|  January, 10 2016 00:00:00 | 15.333333333333334 |
|  January, 15 2016 00:00:00 | 15.294117647058824 |
|  January, 16 2016 00:00:00 |               15.5 |
|  January, 20 2016 00:00:00 | 15.652173913043478 |
|  January, 21 2016 00:00:00 |               15.6 |
|  January, 31 2016 00:00:00 | 15.555555555555555 |
| February, 01 2016 00:00:00 | 15.517241379310345 |
| February, 10 2016 00:00:00 | 15.483870967741936 |
| February, 15 2016 00:00:00 | 15.652173913043478 |
| February, 26 2016 00:00:00 | 15.333333333333334 |
|    March, 04 2016 00:00:00 |                 15 |
|    March, 18 2016 00:00:00 |                 15 |

Run Code Online (Sandbox Code Playgroud)

这是SQL Fiddle。

使用推荐的索引，这个解决方案应该不会太糟糕。

有一个类似的问题，但对于 SQL Server（使用窗口函数的日期范围滚动总和）。Postgres 似乎支持RANGE指定大小的窗口，而 SQL Server 目前不支持。因此，Postgres 的解决方案可能会更简单一些。

关键部分是：

AVG(...) OVER (ORDER BY call_dt_key RANGE BETWEEN 41 PRECEDING AND 1 PRECEDING)

Run Code Online (Sandbox Code Playgroud)

要使用这些窗口函数计算移动平均值，您可能必须首先填充日期中的空白，以便表格每天至少有一行（在这些虚拟行中有NULL值aes_raw）。

...

正如Erwin Brandstetter在他的回答中正确指出的那样，目前（从 Postgres 9.5 开始）Postgres 中的RANGE子句仍然具有类似于 SQL Server 的限制。文档说：

该值之前和值以下情况下，目前只允许行模式。

因此，RANGE即使您使用 Postgres 9.5 ，上述方法也不适合您。

使用窗口函数

您可以使用上述 SQL Server 问题中概述的方法。例如，将您的数据分组为每日总和，添加缺失天数的行，计算移动SUM和COUNT使用OVER，ROWS然后计算移动平均值。

沿着这些路线的东西：

WITH
CTE_Dates
AS
(
    SELECT
        call_dt_key
        ,SUM(test_aes.aes_raw) AS sum_daily
        ,COUNT(*) AS cnt_daily
        ,AVG(test_aes.aes_raw) AS avg_daily
        ,LEAD(call_dt_key) OVER(ORDER BY call_dt_key) - INTERVAL '1 day' AS next_date
    FROM test_aes
    GROUP BY call_dt_key
)
,CTE_AllDates
AS
(
    SELECT
        CASE WHEN call_dt_key = dt THEN call_dt_key ELSE NULL END AS final_dt
        ,avg_daily
        ,SUM(CASE WHEN call_dt_key = dt THEN sum_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING)
        /SUM(CASE WHEN call_dt_key = dt THEN cnt_daily ELSE NULL END) 
            OVER (ORDER BY dt ROWS BETWEEN 41 PRECEDING AND 1 PRECEDING) AS avg_40
    FROM
        CTE_Dates
        INNER JOIN LATERAL
            generate_series(call_dt_key, COALESCE(next_date, call_dt_key), '1 day') 
            AS all_dates(dt) ON true
)
SELECT
    final_dt
    ,COALESCE(avg_40, avg_daily) AS final_avg
FROM CTE_AllDates
WHERE final_dt IS NOT NULL
ORDER BY final_dt;

Run Code Online (Sandbox Code Playgroud)

结果与第一个变体相同。请参阅SQL 小提琴。

同样，这可以用没有 CTE 的内联子查询来编写。

值得检查不同变体的性能的真实数据。

归档时间：	9 年，3 月前
查看次数：	3047 次
最近记录：	9 年，3 月前