当前行日期过去 12 个月的总销售额

ant*_*t1j 5 postgresql performance join optimization postgresql-9.6 query-performance

我需要根据给定月份的每一行计算给定 client_id 的过去 12 个月的销售额总和。

这是按客户按月汇总的销售额的初始表(此处针对特定客户进行过滤511656A75):

CREATE TEMP TABLE foo AS
SELECT idclient, month_transac, sales
FROM ( VALUES
  ( '511656A75', '2010-06-01',  68.57 ),
  ( '511656A75', '2010-07-01',  88.63 ),
  ( '511656A75', '2010-08-01',  94.91 ),
  ( '511656A75', '2010-09-01',  70.66 ),
  ( '511656A75', '2010-10-01',  28.84 ),
  ( '511656A75', '2015-10-01',  85.00 ),
  ( '511656A75', '2015-12-01', 114.42 ),
  ( '511656A75', '2016-01-01', 137.08 ),
  ( '511656A75', '2016-03-01', 172.92 ),
  ( '511656A75', '2016-04-01', 125.00 ),
  ( '511656A75', '2016-05-01', 127.08 ),
  ( '511656A75', '2016-06-01', 104.17 ),
  ( '511656A75', '2016-07-01',  98.22 ),
  ( '511656A75', '2016-08-01',  37.08 ),
  ( '511656A75', '2016-10-01', 108.33 ),
  ( '511656A75', '2016-11-01', 104.17 ),
  ( '511656A75', '2017-01-01', 201.67 )
) AS t(idclient, month_transac, sales);
Run Code Online (Sandbox Code Playgroud)

请注意,有些月份没有任何销售(没有行),所以我想我不能使用WINDOW函数(例如前面的 12 行)。

对类似问题使用这个很好的答案(滚动总和/计数/日期间隔的平均值)我已经完成了这个查询:

SELECT t1.idclient
    , t1.month_transac
    , t1.sales
    , SUM(t2.sales) as sales_ttm 
FROM temp_sales_sample_month_aggr t1
LEFT JOIN  temp_sales_sample_month_aggr t2 USING (idclient)
    WHERE 
        t1.idclient = '511656A75' -- for example only
        AND t2.month_transac >= (t1.month_transac - interval '12 months') 
        AND t2.month_transac < t1.month_transac 
GROUP BY 1, 2, 3
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)

结果正常:sales_ttm是过去 12 个月的销售额总和,没有行月份的销售额(即最后一行 2017 年 1 月总和所有 2016 年销售额)。

 idclient  | month_transac | sales  | sales_ttm
-----------+---------------+--------+---------
 511656A75 | 2010-07-01    |  88.63 |   68.57
 511656A75 | 2010-08-01    |  94.91 |  157.20
 [...]
 511656A75 | 2015-12-01    | 114.42 |  824.83
 511656A75 | 2016-01-01    | 137.08 |  892.17
 511656A75 | 2016-03-01    | 172.92 |  752.75
 511656A75 | 2016-04-01    | 125.00 |  925.67
 511656A75 | 2016-05-01    | 127.08 | 1028.17
 511656A75 | 2016-06-01    | 104.17 | 1155.25
 511656A75 | 2016-07-01    |  98.22 | 1073.59
 511656A75 | 2016-08-01    |  37.08 | 1171.81
 511656A75 | 2016-10-01    | 108.33 | 1000.97
 511656A75 | 2016-11-01    | 104.17 | 1024.30
 511656A75 | 2017-01-01    | 201.67 | 1014.05
Run Code Online (Sandbox Code Playgroud)

问题是第一个月(这里是 2010 年 6 月 - 请参阅初始表中的第一行值)不在结果集中,因为它没有过去的销售量,因此 LEFT JOIN 没有对应的行。

预期/通缉:

 idclient  | month_transac | sales  | sales_ttm
-----------+---------------+--------+---------
 511656A75 | 2010-06-01    |  68.57 |    0.00
 511656A75 | 2010-07-01    |  88.63 |   68.57
 511656A75 | 2010-08-01    |  94.91 |  157.20
 511656A75 | 2010-09-01    |  70.66 |  252.11
[...]
Run Code Online (Sandbox Code Playgroud)

我可以添加行的销售额(用 at2.month_transac <= t1.month_transac然后减去它),但我想我可以找到一种更优雅的方法。

我也尝试使用LATERALjoin (正如 Erwin 在他的 anwser 中所建议的那样(“运行具有范围条件的自连接应该更有效,因为 Postgres 9.1 还没有 LATERAL 连接”),但我猜我还没有掌握它的工作方式,因为我只设法得到错误。

  • 你确认WINDOW应该排除函数吗?
  • 有没有办法使用“简单”的 LEFY JOIN 来获取所有行t1
  • LATERAL在这种情况下可能有用,如何?
  • 有哪些优化方法?

使用 PostgreSQL 9.6.2、Windows 10 或 Ubuntu 16.04


绩效评估

所以到目前为止我们有 3 种可能的解决方案;让我们看看哪个表现更好我检查了结果表是否相同(它们是相同的)。在 270k 行表上完成测试,知道它是来自所有客户端的 1% 的样本的结果表

初步方法 -LEFT JOINGROUP BY

它是问题中建议查询的更正版本,即在总和中包含当前月份,并从总和中减去月份的值,以获得所有行。

SELECT t1.idclient
    , t1.month_transac
    , t1.sales
    , SUM(t2.sales) - t1.sales as sales_ttm 
FROM temp_sales_sample_month_aggr t1
LEFT JOIN  temp_sales_sample_month_aggr t2 USING (idclient)
    WHERE 
        t2.month_transac >= (t1.month_transac - interval '12 months') AND
        t2.month_transac <= t1.month_transac 
GROUP BY 1, 2, 3
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)

查询性能:

Planning time:     3.615 ms
Execution time: 1315.636 ms
Run Code Online (Sandbox Code Playgroud)

@joanolo 方法 - 子查询

SELECT 
      t1.idclient
    , t1.month_transac
    , t1.sales
    , (SELECT 
            coalesce(SUM(t2.sales), 0) 
       FROM 
            temp_sales_sample_month_aggr t2
       WHERE 
            t2.idclient = t1.idclient 
            AND t2.month_transac >= (t1.month_transac - interval '12 months') 
            AND t2.month_transac < t1.month_transac
      ) AS sales_ttm 
FROM 
    temp_sales_sample_month_aggr t1
GROUP BY 
    t1.idclient, t1.month_transac, t1.sales
ORDER BY 
    t1.month_transac ;
Run Code Online (Sandbox Code Playgroud)

查询性能:

Planning time:     0.350 ms
Execution time: 3163.354 ms
Run Code Online (Sandbox Code Playgroud)

我想它有更多的行来处理子查询

LEFT JOIN LATERAL 方法

我终于设法让它工作了。

SELECT t1.idclient
    , t1.month_transac
    , t1.sales
    , COALESCE(lat.sales_ttm, 0.0)
FROM temp_sales_sample_month_aggr t1
LEFT JOIN LATERAL (
    SELECT SUM(t2.sales) as sales_ttm
    FROM temp_sales_sample_month_aggr t2
    WHERE 
        t1.idclient = t2.idclient AND
        t2.month_transac >= (t1.month_transac - interval '12 months') AND
        t2.month_transac < t1.month_transac 
) lat ON TRUE
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)

查询性能:

Planning time:     0.468 ms
Execution time: 2773.754 ms
Run Code Online (Sandbox Code Playgroud)

所以我想 LATERAL 在这里没有帮助,与更简单的相比 LEFT JOIN

Eva*_*oll 3

像这样的东西应该有效..

-- IN A CTE
-- Grab the idclient, and the monthly range needed
-- We need the range because you can't sum over NULL (yet, afaik).
WITH idclient_month AS (
  SELECT idclient, month_transac
  FROM (
    SELECT idclient, min(month_transac), max(month_transac)
    FROM foo
    GROUP BY idclient
  ) AS t
  CROSS JOIN LATERAL generate_series(min::date, max::date, '1 month')
    AS gs(month_transac)
)
-- If we move this where clause down the rows get filtered /before/ the window function
SELECT *
FROM (

  SELECT
    idclient,
    month_transac,
    monthly_sales,
    sum(monthly_sales) OVER (
      PARTITION BY idclient
      ORDER BY month_transac
      ROWS 12 PRECEDING
    )
      - monthly_sales
      AS sales_ttm

  -- Here, we sum up the sales by idclient, and month
  -- We coalesce to 0 so we can use this in a window function
  FROM (
    SELECT idclient, month_transac, coalesce(sum(sales), 0) AS monthly_sales
    FROM foo
    RIGHT OUTER JOIN idclient_month
      USING (idclient,month_transac)
    GROUP BY idclient, month_transac
    ORDER BY idclient, month_transac
  ) AS t

) AS g
WHERE g.monthly_sales > 0;
Run Code Online (Sandbox Code Playgroud)

在这里,我们

  1. 计算 CTE 中 idclient 的日期范围。

    SELECT idclient, month_transac
    FROM (
      SELECT idclient, min(month_transac), max(month_transac)
      FROM foo
      GROUP BY idclient
    ) AS t
    CROSS JOIN LATERAL generate_series(min::date, max::date, '1 month')
      AS gs(month_transac)
     idclient  |     month_transac      
    -----------+------------------------
     511656A75 | 2010-06-01 00:00:00-05
     511656A75 | 2010-07-01 00:00:00-05
     511656A75 | 2010-08-01 00:00:00-05
     511656A75 | 2010-09-01 00:00:00-05
     511656A75 | 2010-10-01 00:00:00-05
     511656A75 | 2010-11-01 00:00:00-05
     511656A75 | 2010-12-01 00:00:00-06
     511656A75 | 2011-01-01 00:00:00-06
     [....]
    
    Run Code Online (Sandbox Code Playgroud)
  2. RIGHT OUTER该 CTE 到我们的示例数据集。我们这样做是为了增加我们的样本数据集,并且在需要的地方我们有monthly_sales = 0的条目。

  3. 使用使用 windows over 的窗口函数ROWS 12 PRECEDING。这就是关键。那是过去12个月。窗口函数无法对空行进行操作,因此在执行此步骤之前我们将它们设置为 0。

  4. 仅选择 所在的行monthly_sales > 0。我们必须在窗口函数之后执行此操作,以免过多影响可用于计算的内容(窗口)。

输出,

 idclient  |     month_transac      | monthly_sales | sales_ttm 
-----------+------------------------+---------------+-----------
 511656A75 | 2010-06-01 00:00:00-05 |         68.57 |      0.00
 511656A75 | 2010-07-01 00:00:00-05 |         88.63 |     68.57
 511656A75 | 2010-08-01 00:00:00-05 |         94.91 |    157.20
 511656A75 | 2010-09-01 00:00:00-05 |         70.66 |    252.11
 511656A75 | 2010-10-01 00:00:00-05 |         28.84 |    322.77
 511656A75 | 2015-10-01 00:00:00-05 |         85.00 |      0.00
 511656A75 | 2015-12-01 00:00:00-06 |        114.42 |     85.00
 511656A75 | 2016-01-01 00:00:00-06 |        137.08 |    199.42
 511656A75 | 2016-03-01 00:00:00-06 |        172.92 |    336.50
 511656A75 | 2016-04-01 00:00:00-05 |        125.00 |    509.42
 511656A75 | 2016-05-01 00:00:00-05 |        127.08 |    634.42
 511656A75 | 2016-06-01 00:00:00-05 |        104.17 |    761.50
 511656A75 | 2016-07-01 00:00:00-05 |         98.22 |    865.67
 511656A75 | 2016-08-01 00:00:00-05 |         37.08 |    963.89
 511656A75 | 2016-10-01 00:00:00-05 |        108.33 |   1000.97
 511656A75 | 2016-11-01 00:00:00-05 |        104.17 |   1024.30
 511656A75 | 2017-01-01 00:00:00-06 |        201.67 |   1014.05
(17 rows)
Run Code Online (Sandbox Code Playgroud)