max*_*ire 9 postgresql performance optimization greatest-n-per-group postgresql-performance
我正在使用 Postgres 9.5。我有一个记录来自多个网站的页面点击量的表格。该表包含从 2016 年 1 月 1 日到 2016 年 6 月 30 日的大约 3200 万行。
CREATE TABLE event_pg (
timestamp_ timestamp without time zone NOT NULL,
person_id character(24),
location_host varchar(256),
location_path varchar(256),
location_query varchar(256),
location_fragment varchar(256)
);
Run Code Online (Sandbox Code Playgroud)
我正在尝试调整一个查询,该查询计算执行给定页面命中序列的人数。该查询旨在回答诸如“有多少人查看了主页,然后访问了帮助站点,然后查看了感谢页面”之类的问题?结果看起来像这样
?????????????????????????????????????????
? home-page ? help site ? thankyou ?
?????????????????????????????????????????
? 10000 ? 9800 ?1500 ?
?????????????????????????????????????????
Run Code Online (Sandbox Code Playgroud)
请注意数字正在减少,这是有道理的,因为查看主页的 10000 人 9800 继续访问了帮助站点,而其中 1500 人继续点击了感谢页面。
3 步序列的 SQL 使用横向连接,如下所示:
SELECT
sum(view_homepage) AS view_homepage,
sum(use_help) AS use_help,
sum(thank_you) AS thank_you
FROM (
-- Get the first time each user viewed the homepage.
SELECT X.person_id,
1 AS view_homepage,
min(timestamp_) AS view_homepage_time
FROM event_pg X
WHERE X.timestamp_ between '2016-04-23 00:00:00.0' and timestamp '2016-04-30 23:59:59.999'
AND X.location_host like '2015.testonline.ca'
GROUP BY X.person_id
) e1
LEFT JOIN LATERAL (
SELECT
Y.person_id,
1 AS use_help,
timestamp_ AS use_help_time
FROM event_pg Y
WHERE
Y.person_id = e1.person_id AND
location_host = 'helpcentre.testonline.ca' AND
timestamp_ BETWEEN view_homepage_time AND timestamp '2016-04-30 23:59:59.999'
ORDER BY timestamp_
LIMIT 1
) e2 ON true
LEFT JOIN LATERAL (
SELECT
1 AS thank_you,
timestamp_ AS thank_you_time
FROM event_pg Z
WHERE Z.person_id = e2.person_id AND
location_fragment = '/file/thank-you' AND
timestamp_ BETWEEN use_help_time AND timestamp '2016-04-30 23:59:59.999'
ORDER BY timestamp_
LIMIT 1
) e3 ON true;
Run Code Online (Sandbox Code Playgroud)
我有一个索引timestamp_
,person_id
而location
列。几天或几周的日期范围查询非常快(1 到 10 秒)。当我尝试对 1 月 1 日和 7 月 30 日之间的所有内容运行查询时,它变得缓慢。这需要一分钟多的时间。如果您比较下面的两个解释,您可以看到它不再使用时间戳索引,而是执行 Seq 扫描,因为索引不会为我们购买任何东西,因为我们正在查询“所有时间”,因此几乎所有表中的记录.
现在我意识到横向连接的嵌套循环性质会减慢它必须循环的更多记录,但是有什么方法可以加快这个查询以获取巨大的日期范围,以便更好地扩展?
Erw*_*ter 11
您正在使用奇数数据类型。character(24)
? char(n)
是一种过时的类型,几乎总是错误的选择。您有索引person_id
并反复加入。integer
由于多种原因,效率会更高。(或者bigint
,如果您计划在表的生命周期内销毁超过 20 亿行。)相关:
LIKE
没有通配符是没有意义的。使用=
来代替。快点。
x.location_host LIKE '2015.testonline.ca'
x.location_host = '2015.testonline.ca'
使用count(e1.*)
orcount(*)
代替1
为每个子查询添加一个带有值的虚拟列。(除了最后一个 ( e3
),您不需要任何实际数据。)
有时将字符串文字转换为timestamp
( timestamp '2016-04-30 23:59:59.999'
)有时不转换为 ( )。无论它是有道理的,然后去做所有的时间,或者没有,那就不要去做。
它没有。与timestamp
列相比,字符串文字timestamp
无论如何都会被强制。所以你不需要一个明确的演员。
Postgres 数据类型timestamp
最多有 6 个小数位。你的BETWEEN
表情离开了极端情况。我用不易出错的表达式替换了它们。
重要提示:为了优化性能,请创建多列索引。
对于第一个子查询hp
:
CREATE INDEX event_pg_location_host_timestamp__idx
ON event_pg (location_host, timestamp_);
Run Code Online (Sandbox Code Playgroud)
或者,如果您可以从中获取仅索引扫描,请附加person_id
到索引:
CREATE INDEX event_pg_location_host_timestamp__person_id_idx
ON event_pg (location_host, timestamp_, person_id);
Run Code Online (Sandbox Code Playgroud)
对于跨越大部分或所有表的非常大的时间范围,这个索引应该是可取的——它也支持hlp
子查询,所以以任何一种方式创建它:
CREATE INDEX event_pg_location_host_person_id_timestamp__idx
ON event_pg (location_host, person_id, timestamp_);
Run Code Online (Sandbox Code Playgroud)
对于tnk
:
CREATE INDEX event_pg_location_fragment_timestamp__idx
ON event_pg (location_fragment, person_id, timestamp_);
Run Code Online (Sandbox Code Playgroud)
如果你的谓词location_host
和location_fragment
是常量,我们可以使用更便宜的部分索引来代替,特别是因为你的location_*
列看起来很大:
CREATE INDEX event_pg_hp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = '2015.testonline.ca';
CREATE INDEX event_pg_hlp_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_host = 'helpcentre.testonline.ca';
CREATE INDEX event_pg_tnk_person_id_ts_idx ON event_pg (person_id, timestamp_)
WHERE location_fragment = '/file/thank-you';
Run Code Online (Sandbox Code Playgroud)
考虑:
同样,所有这些指标是具有基本上更小,更快integer
或bigint
为person_id
。
通常,您需要ANALYZE
在创建新索引后访问该表 - 或者等到 autovacuum 为您执行此操作。
要获得仅索引扫描,您的表必须VACUUM
足够。之后立即测试VACUUM
作为概念证明。如果您不熟悉仅索引扫描,请阅读链接的 Postgres Wiki 页面以了解详细信息。
实施我讨论的内容。查询小范围(每行几行person_id
):
SELECT count(*)::int AS view_homepage
, count(hlp.hlp_ts)::int AS use_help
, count(tnk.yes)::int AS thank_you
FROM (
SELECT DISTINCT ON (person_id)
person_id, timestamp_ AS hp_ts
FROM event_pg
WHERE timestamp_ >= '2016-04-23'
AND timestamp_ < '2016-05-01'
AND location_host = '2015.testonline.ca'
ORDER BY person_id, timestamp_
) hp
LEFT JOIN LATERAL (
SELECT timestamp_ AS hlp_ts
FROM event_pg y
WHERE y.person_id = hp.person_id
AND timestamp_ >= hp.hp_ts
AND timestamp_ < '2016-05-01'
AND location_host = 'helpcentre.testonline.ca'
ORDER BY timestamp_
LIMIT 1
) hlp ON true
LEFT JOIN LATERAL (
SELECT true AS yes -- we only need existence
FROM event_pg z
WHERE z.person_id = hp.person_id -- we can use hp here
AND location_fragment = '/file/thank-you'
AND timestamp_ >= hlp.hlp_ts -- this introduces dependency on hlp anyways.
AND timestamp_ < '2016-05-01'
ORDER BY timestamp_
LIMIT 1
) tnk ON true;
Run Code Online (Sandbox Code Playgroud)
DISTINCT ON
每几行通常更便宜person_id
。详细解释:
如果每个行有很多行person_id
(更可能在更大的时间范围内),那么第1a章的这个答案中讨论的递归 CTE可以(快得多):
请参阅下面的集成。
这是一个古老的难题:一种查询技术最适用于较小的集合,另一种则适用于较大的集合。在您的特定情况下,我们从一开始就有一个非常好的指标——给定时间段的长度——我们可以用它来决定。
我们将其全部包装在一个 PL/pgSQL 函数中。DISTINCT ON
当给定的时间段长于设定的阈值时,我的实现从rCTE切换:
CREATE OR REPLACE FUNCTION f_my_counts(_ts_low_inc timestamp, _ts_hi_excl timestamp)
RETURNS TABLE (view_homepage int, use_help int, thank_you int) AS
$func$
BEGIN
CASE
WHEN _ts_hi_excl <= _ts_low_inc THEN
RAISE EXCEPTION 'Timestamp _ts_hi_excl (1st param) must be later than _ts_low_inc!';
WHEN _ts_hi_excl - _ts_low_inc < interval '10 days' THEN -- example value !!!
-- DISTINCT ON for few rows per person_id
RETURN QUERY
WITH hp AS (
SELECT DISTINCT ON (person_id)
person_id, timestamp_ AS hp_ts
FROM event_pg
WHERE timestamp_ >= _ts_low_inc
AND timestamp_ < _ts_hi_excl
AND location_host = '2015.testonline.ca'
ORDER BY person_id, timestamp_
)
, hlp AS (
SELECT hp.person_id, hlp.hlp_ts
FROM hp
CROSS JOIN LATERAL (
SELECT timestamp_ AS hlp_ts
FROM event_pg
WHERE person_id = hp.person_id
AND timestamp_ >= hp.hp_ts
AND timestamp_ < _ts_hi_excl
AND location_host = 'helpcentre.testonline.ca' -- match partial idx
ORDER BY timestamp_
LIMIT 1
) hlp
)
SELECT (SELECT count(*)::int FROM hp) -- AS view_homepage
, (SELECT count(*)::int FROM hlp) -- AS use_help
, (SELECT count(*)::int -- AS thank_you
FROM hlp
CROSS JOIN LATERAL (
SELECT 1 -- we only care for existence
FROM event_pg
WHERE person_id = hlp.person_id
AND location_fragment = '/file/thank-you'
AND timestamp_ >= hlp.hlp_ts
AND timestamp_ < _ts_hi_excl
ORDER BY timestamp_
LIMIT 1
) tnk
);
ELSE
-- rCTE for many rows per person_id
RETURN QUERY
WITH RECURSIVE hp AS (
( -- parentheses required
SELECT person_id, timestamp_ AS hp_ts
FROM event_pg
WHERE timestamp_ >= _ts_low_inc
AND timestamp_ < _ts_hi_excl
AND location_host = '2015.testonline.ca' -- match partial idx
ORDER BY person_id, timestamp_
LIMIT 1
)
UNION ALL
SELECT x.*
FROM hp, LATERAL (
SELECT person_id, timestamp_ AS hp_ts
FROM event_pg
WHERE person_id > hp.person_id -- lateral reference
AND timestamp_ >= _ts_low_inc -- repeat conditions
AND timestamp_ < _ts_hi_excl
AND location_host = '2015.testonline.ca' -- match partial idx
ORDER BY person_id, timestamp_
LIMIT 1
) x
)
, hlp AS (
SELECT hp.person_id, hlp.hlp_ts
FROM hp
CROSS JOIN LATERAL (
SELECT timestamp_ AS hlp_ts
FROM event_pg y
WHERE y.person_id = hp.person_id
AND location_host = 'helpcentre.testonline.ca' -- match partial idx
AND timestamp_ >= hp.hp_ts
AND timestamp_ < _ts_hi_excl
ORDER BY timestamp_
LIMIT 1
) hlp
)
SELECT (SELECT count(*)::int FROM hp) -- AS view_homepage
, (SELECT count(*)::int FROM hlp) -- AS use_help
, (SELECT count(*)::int -- AS thank_you
FROM hlp
CROSS JOIN LATERAL (
SELECT 1 -- we only care for existence
FROM event_pg
WHERE person_id = hlp.person_id
AND location_fragment = '/file/thank-you'
AND timestamp_ >= hlp.hlp_ts
AND timestamp_ < _ts_hi_excl
ORDER BY timestamp_
LIMIT 1
) tnk
);
END CASE;
END
$func$ LANGUAGE plpgsql STABLE STRICT;
Run Code Online (Sandbox Code Playgroud)
称呼:
SELECT * FROM f_my_counts('2016-01-23', '2016-05-01');
Run Code Online (Sandbox Code Playgroud)
根据定义,rCTE 与 CTE 一起使用。我还为查询添加了CTEDISTINCT ON
(就像我在评论中与 @Lennart 讨论的那样),这允许我们使用CROSS JOIN
而不是LEFT JOIN
减少每个步骤的集合,因为我们可以单独计算每个 CTE。这会产生相反方向的效果:
你必须测试哪个比另一个更重要。
归档时间: |
|
查看次数: |
8114 次 |
最近记录: |