Str*_*667 5 postgresql performance index
架构:
CREATE TABLE traffic_hit (
id SERIAL NOT NULL PRIMARY KEY,
country VARCHAR(2) NOT NULL,
created TIMESTAMP WITH TIME ZONE NOT NULL,
unique BOOLEAN NOT NULL,
user_agent_id INTEGER NULL
);
CREATE TABLE utils_useragent (
id SERIAL NOT NULL PRIMARY KEY,
user_agent_string TEXT NOT NULL UNIQUE,
is_robot BOOLEAN NOT NULL
);
Run Code Online (Sandbox Code Playgroud)
初始查询:
SELECT
traffic_hit.created::DATE AS group_by,
COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
COUNT(*) AS non_unique_visits
FROM
traffic_hit
LEFT JOIN utils_useragent ON traffic_hit.user_agent_id = utils_useragent.id
WHERE
traffic_hit.created >= '2016-01-01' AND
traffic_hit.created < '2017-01-01' AND
traffic_hit.country = 'CZ' AND
utils_useragent.is_robot = FALSE
GROUP BY 1
Run Code Online (Sandbox Code Playgroud)
索引:
CREATE INDEX traffic_hit_user_agent_id ON traffic_hit (user_agent_id);
CREATE INDEX new_idx ON traffic_hit(created, country, user_agent_id, unique);
CREATE INDEX robots ON utils_useragent (id) WHERE is_robot = TRUE
Run Code Online (Sandbox Code Playgroud)
查询计划:
HashAggregate (cost=582436.93..603769.28 rows=1706588 width=20) (actual time=2514.233..2515.597 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
Group Key: (traffic_hit.created)::date
-> Hash Join (cost=15732.00..545234.80 rows=4960285 width=5) (actual time=83.141..2157.453 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Hash Cond: (traffic_hit.user_agent_id = utils_useragent.id)
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=0.56..448722.21 rows=5007358 width=13) (actual time=0.066..1278.475 rows=4618870 loops=1)
Output: traffic_hit.created, traffic_hit.country, traffic_hit.user_agent_id, traffic_hit.""unique"""
Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
Heap Fetches: 40448
-> Hash (cost=10806.55..10806.55 rows=393991 width=4) (actual time=77.531..77.531 rows=393896 loops=1)
Output: utils_useragent.id
Buckets: 524288 Batches: 1 Memory Usage: 17944kB
-> Index Only Scan using utils_useragent_id_idx on public.utils_useragent (cost=0.42..10806.55 rows=393991 width=4) (actual time=0.071..29.285 rows=393896 loops=1)
Output: utils_useragent.id
Heap Fetches: 5932
Planning time: 0.918 ms
Execution time: 2531.195 ms
Run Code Online (Sandbox Code Playgroud)
数据:
大约有4000条记录与is_robot = TRUE
和395000条记录与is_robot = FALSE
在utils_useragent
表中。表 traffic_hit 包含 2016 年的大约 1200 万条记录。
目标:
提高读取性能,因为查询用于报告应用程序并且对用户很重要。
我的方法:
由于 utils_useragent 表中的“机器人”很少,因此使用部分索引应该会更快。我想使用的另一件事是仅用于索引扫描的多列索引
SELECT
traffic_hit.created::DATE AS group_by,
COUNT(*) FILTER(WHERE traffic_hit.unique) AS unique_visits,
COUNT(*) AS non_unique_visits
FROM
traffic_hit
WHERE
traffic_hit.created >= '2016-01-01' AND
traffic_hit.created < '2017-01-01' AND
traffic_hit.country = 'CZ' AND
user_agent_id NOT IN (select id from utils_useragent where is_robot = TRUE)
GROUP BY 1
Run Code Online (Sandbox Code Playgroud)
新的查询计划:
HashAggregate (cost=486612.46..503842.68 rows=1378418 width=20) (actual time=2281.282..2282.627 rows=366 loops=1)
Output: ((traffic_hit.created)::date), count(*) FILTER (WHERE traffic_hit.""unique""), count(*)"
Group Key: (traffic_hit.created)::date
-> Index Only Scan using traffic_hit_created_country_user_agent_id_unique_idx on public.traffic_hit (cost=275.23..467834.60 rows=2503714 width=5) (actual time=2.223..1922.960 rows=2430245 loops=1)
Output: (traffic_hit.created)::date, traffic_hit.""unique"""
Index Cond: ((traffic_hit.created >= '2016-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.created < '2017-01-01 00:00:00+01'::timestamp with time zone) AND (traffic_hit.country = 'CZ'::text))
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 2188625
Heap Fetches: 40448
SubPlan 1
-> Index Only Scan using only_robots on public.utils_useragent (cost=0.28..265.32 rows=3739 width=4) (actual time=0.031..0.682 rows=3763 loops=1)
Output: utils_useragent.id
Heap Fetches: 0
Planning time: 0.510 ms
Execution time: 2297.849 ms
Run Code Online (Sandbox Code Playgroud)
新查询更快,但Filter: (NOT (hashed SubPlan 1))
计划中有一部分,这让我感到困惑。
问题:
为什么不使用索引来过滤user_agent_id
?是否可以使用它来提高查询性能?或者其他一些方法会更好?
PostgreSQL 版本:9.6.3
它确实使用了索引。它使用索引构建哈希表,然后在过滤器中使用该哈希表。使用内存中非共享哈希表将比使用磁盘上共享索引更快。
但是,如果对性能敏感,为什么要重复聚合 6 个月内没有更改的数百万行数据呢?聚合一次并存储结果。您可以使用物化视图来执行此操作,或者只是手动执行此操作。
您可以进行部分聚合,例如聚合数据分组依据date(created)
以及您需要的任何其他列。然后,人们可以将这个缩减的数据集重新聚合到特定的日期范围,只要他们对完整的日期边界感到满意,可以对其他列进行过滤,或者对它们进行聚合,或者按它们进行分组。如果他们想要计数,你必须小心地总结计数,而不是计数计数。如果您想要平均值,则必须小心地按计数对平均值进行加权,而不是对平均值进行未加权平均值。
当然,如果您改变了关于什么是机器人或不是机器人的想法,那么您将不得不重新制作部分聚合表。
不管怎样,瓶颈不是 not-in 语句,而是你想要处理的原始数据量。
即将推出的 PostgreSQL v10 版本中的并行查询可以帮助此查询。
归档时间: |
|
查看次数: |
3263 次 |
最近记录: |