ant*_*t1j 5 postgresql performance join optimization postgresql-9.6 query-performance
我需要根据给定月份的每一行计算给定 client_id 的过去 12 个月的销售额总和。
这是按客户按月汇总的销售额的初始表(此处针对特定客户进行过滤511656A75
):
CREATE TEMP TABLE foo AS
SELECT idclient, month_transac, sales
FROM ( VALUES
( '511656A75', '2010-06-01', 68.57 ),
( '511656A75', '2010-07-01', 88.63 ),
( '511656A75', '2010-08-01', 94.91 ),
( '511656A75', '2010-09-01', 70.66 ),
( '511656A75', '2010-10-01', 28.84 ),
( '511656A75', '2015-10-01', 85.00 ),
( '511656A75', '2015-12-01', 114.42 ),
( '511656A75', '2016-01-01', 137.08 ),
( '511656A75', '2016-03-01', 172.92 ),
( '511656A75', '2016-04-01', 125.00 ),
( '511656A75', '2016-05-01', 127.08 ),
( '511656A75', '2016-06-01', 104.17 ),
( '511656A75', '2016-07-01', 98.22 ),
( '511656A75', '2016-08-01', 37.08 ),
( '511656A75', '2016-10-01', 108.33 ),
( '511656A75', '2016-11-01', 104.17 ),
( '511656A75', '2017-01-01', 201.67 )
) AS t(idclient, month_transac, sales);
Run Code Online (Sandbox Code Playgroud)
请注意,有些月份没有任何销售(没有行),所以我想我不能使用WINDOW
函数(例如前面的 12 行)。
对类似问题使用这个很好的答案(滚动总和/计数/日期间隔的平均值)我已经完成了这个查询:
SELECT t1.idclient
, t1.month_transac
, t1.sales
, SUM(t2.sales) as sales_ttm
FROM temp_sales_sample_month_aggr t1
LEFT JOIN temp_sales_sample_month_aggr t2 USING (idclient)
WHERE
t1.idclient = '511656A75' -- for example only
AND t2.month_transac >= (t1.month_transac - interval '12 months')
AND t2.month_transac < t1.month_transac
GROUP BY 1, 2, 3
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)
结果正常:sales_ttm
是过去 12 个月的销售额总和,没有行月份的销售额(即最后一行 2017 年 1 月总和所有 2016 年销售额)。
idclient | month_transac | sales | sales_ttm
-----------+---------------+--------+---------
511656A75 | 2010-07-01 | 88.63 | 68.57
511656A75 | 2010-08-01 | 94.91 | 157.20
[...]
511656A75 | 2015-12-01 | 114.42 | 824.83
511656A75 | 2016-01-01 | 137.08 | 892.17
511656A75 | 2016-03-01 | 172.92 | 752.75
511656A75 | 2016-04-01 | 125.00 | 925.67
511656A75 | 2016-05-01 | 127.08 | 1028.17
511656A75 | 2016-06-01 | 104.17 | 1155.25
511656A75 | 2016-07-01 | 98.22 | 1073.59
511656A75 | 2016-08-01 | 37.08 | 1171.81
511656A75 | 2016-10-01 | 108.33 | 1000.97
511656A75 | 2016-11-01 | 104.17 | 1024.30
511656A75 | 2017-01-01 | 201.67 | 1014.05
Run Code Online (Sandbox Code Playgroud)
问题是第一个月(这里是 2010 年 6 月 - 请参阅初始表中的第一行值)不在结果集中,因为它没有过去的销售量,因此 LEFT JOIN 没有对应的行。
预期/通缉:
idclient | month_transac | sales | sales_ttm
-----------+---------------+--------+---------
511656A75 | 2010-06-01 | 68.57 | 0.00
511656A75 | 2010-07-01 | 88.63 | 68.57
511656A75 | 2010-08-01 | 94.91 | 157.20
511656A75 | 2010-09-01 | 70.66 | 252.11
[...]
Run Code Online (Sandbox Code Playgroud)
我可以添加行的销售额(用 at2.month_transac <= t1.month_transac
然后减去它),但我想我可以找到一种更优雅的方法。
我也尝试使用LATERAL
join (正如 Erwin 在他的 anwser 中所建议的那样(“运行具有范围条件的自连接应该更有效,因为 Postgres 9.1 还没有 LATERAL 连接”),但我猜我还没有掌握它的工作方式,因为我只设法得到错误。
WINDOW
应该排除函数吗?t1
?LATERAL
在这种情况下可能有用,如何?使用 PostgreSQL 9.6.2、Windows 10 或 Ubuntu 16.04
所以到目前为止我们有 3 种可能的解决方案;让我们看看哪个表现更好我检查了结果表是否相同(它们是相同的)。在 270k 行表上完成测试,知道它是来自所有客户端的 1% 的样本的结果表
LEFT JOIN
和GROUP BY
它是问题中建议查询的更正版本,即在总和中包含当前月份,并从总和中减去月份的值,以获得所有行。
SELECT t1.idclient
, t1.month_transac
, t1.sales
, SUM(t2.sales) - t1.sales as sales_ttm
FROM temp_sales_sample_month_aggr t1
LEFT JOIN temp_sales_sample_month_aggr t2 USING (idclient)
WHERE
t2.month_transac >= (t1.month_transac - interval '12 months') AND
t2.month_transac <= t1.month_transac
GROUP BY 1, 2, 3
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)
查询性能:
Planning time: 3.615 ms
Execution time: 1315.636 ms
Run Code Online (Sandbox Code Playgroud)
SELECT
t1.idclient
, t1.month_transac
, t1.sales
, (SELECT
coalesce(SUM(t2.sales), 0)
FROM
temp_sales_sample_month_aggr t2
WHERE
t2.idclient = t1.idclient
AND t2.month_transac >= (t1.month_transac - interval '12 months')
AND t2.month_transac < t1.month_transac
) AS sales_ttm
FROM
temp_sales_sample_month_aggr t1
GROUP BY
t1.idclient, t1.month_transac, t1.sales
ORDER BY
t1.month_transac ;
Run Code Online (Sandbox Code Playgroud)
查询性能:
Planning time: 0.350 ms
Execution time: 3163.354 ms
Run Code Online (Sandbox Code Playgroud)
我想它有更多的行来处理子查询
LEFT JOIN LATERAL
方法我终于设法让它工作了。
SELECT t1.idclient
, t1.month_transac
, t1.sales
, COALESCE(lat.sales_ttm, 0.0)
FROM temp_sales_sample_month_aggr t1
LEFT JOIN LATERAL (
SELECT SUM(t2.sales) as sales_ttm
FROM temp_sales_sample_month_aggr t2
WHERE
t1.idclient = t2.idclient AND
t2.month_transac >= (t1.month_transac - interval '12 months') AND
t2.month_transac < t1.month_transac
) lat ON TRUE
ORDER BY 2
;
Run Code Online (Sandbox Code Playgroud)
查询性能:
Planning time: 0.468 ms
Execution time: 2773.754 ms
Run Code Online (Sandbox Code Playgroud)
所以我想 LATERAL 在这里没有帮助,与更简单的相比 LEFT JOIN
像这样的东西应该有效..
-- IN A CTE
-- Grab the idclient, and the monthly range needed
-- We need the range because you can't sum over NULL (yet, afaik).
WITH idclient_month AS (
SELECT idclient, month_transac
FROM (
SELECT idclient, min(month_transac), max(month_transac)
FROM foo
GROUP BY idclient
) AS t
CROSS JOIN LATERAL generate_series(min::date, max::date, '1 month')
AS gs(month_transac)
)
-- If we move this where clause down the rows get filtered /before/ the window function
SELECT *
FROM (
SELECT
idclient,
month_transac,
monthly_sales,
sum(monthly_sales) OVER (
PARTITION BY idclient
ORDER BY month_transac
ROWS 12 PRECEDING
)
- monthly_sales
AS sales_ttm
-- Here, we sum up the sales by idclient, and month
-- We coalesce to 0 so we can use this in a window function
FROM (
SELECT idclient, month_transac, coalesce(sum(sales), 0) AS monthly_sales
FROM foo
RIGHT OUTER JOIN idclient_month
USING (idclient,month_transac)
GROUP BY idclient, month_transac
ORDER BY idclient, month_transac
) AS t
) AS g
WHERE g.monthly_sales > 0;
Run Code Online (Sandbox Code Playgroud)
在这里,我们
计算 CTE 中 idclient 的日期范围。
SELECT idclient, month_transac
FROM (
SELECT idclient, min(month_transac), max(month_transac)
FROM foo
GROUP BY idclient
) AS t
CROSS JOIN LATERAL generate_series(min::date, max::date, '1 month')
AS gs(month_transac)
idclient | month_transac
-----------+------------------------
511656A75 | 2010-06-01 00:00:00-05
511656A75 | 2010-07-01 00:00:00-05
511656A75 | 2010-08-01 00:00:00-05
511656A75 | 2010-09-01 00:00:00-05
511656A75 | 2010-10-01 00:00:00-05
511656A75 | 2010-11-01 00:00:00-05
511656A75 | 2010-12-01 00:00:00-06
511656A75 | 2011-01-01 00:00:00-06
[....]
Run Code Online (Sandbox Code Playgroud)RIGHT OUTER
该 CTE 到我们的示例数据集。我们这样做是为了增加我们的样本数据集,并且在需要的地方我们有monthly_sales = 0的条目。
使用使用 windows over 的窗口函数ROWS 12 PRECEDING
。这就是关键。那是过去12个月。窗口函数无法对空行进行操作,因此在执行此步骤之前我们将它们设置为 0。
仅选择 所在的行monthly_sales > 0
。我们必须在窗口函数之后执行此操作,以免过多影响可用于计算的内容(窗口)。
输出,
idclient | month_transac | monthly_sales | sales_ttm
-----------+------------------------+---------------+-----------
511656A75 | 2010-06-01 00:00:00-05 | 68.57 | 0.00
511656A75 | 2010-07-01 00:00:00-05 | 88.63 | 68.57
511656A75 | 2010-08-01 00:00:00-05 | 94.91 | 157.20
511656A75 | 2010-09-01 00:00:00-05 | 70.66 | 252.11
511656A75 | 2010-10-01 00:00:00-05 | 28.84 | 322.77
511656A75 | 2015-10-01 00:00:00-05 | 85.00 | 0.00
511656A75 | 2015-12-01 00:00:00-06 | 114.42 | 85.00
511656A75 | 2016-01-01 00:00:00-06 | 137.08 | 199.42
511656A75 | 2016-03-01 00:00:00-06 | 172.92 | 336.50
511656A75 | 2016-04-01 00:00:00-05 | 125.00 | 509.42
511656A75 | 2016-05-01 00:00:00-05 | 127.08 | 634.42
511656A75 | 2016-06-01 00:00:00-05 | 104.17 | 761.50
511656A75 | 2016-07-01 00:00:00-05 | 98.22 | 865.67
511656A75 | 2016-08-01 00:00:00-05 | 37.08 | 963.89
511656A75 | 2016-10-01 00:00:00-05 | 108.33 | 1000.97
511656A75 | 2016-11-01 00:00:00-05 | 104.17 | 1024.30
511656A75 | 2017-01-01 00:00:00-06 | 201.67 | 1014.05
(17 rows)
Run Code Online (Sandbox Code Playgroud)