sal*_*the 11 sql postgresql optimization join
我正在试图找到一种方法来加速一个特别麻烦的查询,该查询在几个表中按日期聚合一些数据.下面的完整(丑陋)查询以及EXPLAIN ANALYZE显示它有多可怕.
如果有人可以偷看,看看他们是否能发现任何重大问题(很可能,我不是Postgres的人),这将是一流的.
所以这里.查询是:
SELECT
to_char(p.period, 'DD/MM/YY') as period,
coalesce(o.value, 0) AS outbound,
coalesce(i.value, 0) AS inbound
FROM (
SELECT
date '2009-10-01' + s.day
AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 1
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS o ON p.period = o.period
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 2
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS i ON p.period = i.period
Run Code Online (Sandbox Code Playgroud)
将EXPLAIN ANALYZE可以在这里找到:在explain.depesz.com
任何意见或问题都表示赞赏.
干杯
Dis*_*ned 18
优化查询时总有两件事需要考虑:
一些观察:
您在加入日期之前正在执行日期操作.作为一般规则,这将阻止查询优化器使用索引,即使它存在.您应该尝试编写表达式,使得索引列在表达式的一侧保持不变.
您的子查询将过滤到与之相同的日期范围generate_series.这是一个重复,它限制了优化器选择最有效优化的能力.我怀疑可能已被写入来提高性能,因为optimser无法日期列上使用索引(body_time)?
注意:我们实际上非常希望使用索引Body.body_time
ORDER BY在子查询中最多是多余的.在最坏的情况下,它可能会强制查询优化器在加入之前对结果集进行排序; 这不一定对查询计划有利.而是仅在最后应用订购以进行最终显示.
使用LEFT JOIN你的子查询是不恰当的.假设你正在使用ANSI约定NULL的行为(和你应该),任何外部连接到envelope将返回envelope_command=NULL,而这些将因此由条件被排除在外envelope_command=?.
除了值之外,子查询o和i几乎完全相同envelope_command.这会强制优化器两次扫描相同的基础表.您可以使用数据透视表技术连接一次数据,并将值拆分为2列.
尝试使用pivot技术的以下内容:
SELECT p.period,
/*The pivot technique in action...*/
SUM(
CASE WHEN envelope_command = 1 THEN body_size
ELSE 0
END) AS Outbound,
SUM(
CASE WHEN envelope_command = 2 THEN body_size
ELSE 0
END) AS Inbound
FROM (
SELECT date '2009-10-01' + s.day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
/*The left JOIN is justified to ensure ALL generated dates are returned
Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
LEFT OUTER JOIN (
SELECT b.body_size,
b.body_time,
e.envelope_command
FROM body AS b
INNER JOIN envelope e
ON e.message_id = b.message_id
WHERE envelope_command IN (1, 2)
) d
/*The expressions below allow the optimser to use an index on body_time if
the statistics indicate it would be beneficial*/
ON d.body_time >= p.period
AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period
Run Code Online (Sandbox Code Playgroud)
编辑:添加了Tom H.建议的过滤器
基于 Craig Young 的建议,这里是修改后的查询,它在大约 1.8 秒内针对我正在处理的数据集运行。这比原来的 ~2.0s 略有改进,而 Craig's 花了 ~22s 的巨大改进。
SELECT
p.period,
/* The pivot technique... */
SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
/* Get days range */
SELECT date '2009-10-01' + day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
/* Join message information */
LEFT OUTER JOIN
(
SELECT b.body_size, b.body_time::date, e.envelope_command
FROM body AS b
INNER JOIN envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command IN (2, 1)
AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period
Run Code Online (Sandbox Code Playgroud)