uld*_*all 15 postgresql performance index optimization postgresql-9.3 query-performance
我正在尝试确定哪些索引用于带有WHERE
条件的 SQL 查询,GROUP BY
而当前运行速度很慢。
我的查询:
SELECT group_id
FROM counter
WHERE ts between timestamp '2014-03-02 00:00:00.0' and timestamp '2014-03-05 12:00:00.0'
GROUP BY group_id
Run Code Online (Sandbox Code Playgroud)
该表目前有 32.000.000 行。当我增加时间范围时,查询的执行时间会增加很多。
有问题的表如下所示:
CREATE TABLE counter (
id bigserial PRIMARY KEY
, ts timestamp NOT NULL
, group_id bigint NOT NULL
);
Run Code Online (Sandbox Code Playgroud)
我目前有以下索引,但性能仍然很慢:
CREATE INDEX ts_index
ON counter
USING btree
(ts);
CREATE INDEX group_id_index
ON counter
USING btree
(group_id);
CREATE INDEX comp_1_index
ON counter
USING btree
(ts, group_id);
CREATE INDEX comp_2_index
ON counter
USING btree
(group_id, ts);
Run Code Online (Sandbox Code Playgroud)
对查询运行 EXPLAIN 会得到以下结果:
"QUERY PLAN"
"HashAggregate (cost=467958.16..467958.17 rows=1 width=4)"
" -> Index Scan using ts_index on counter (cost=0.56..467470.93 rows=194892 width=4)"
" Index Cond: ((ts >= '2014-02-26 00:00:00'::timestamp without time zone) AND (ts <= '2014-02-27 23:59:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)
带有示例数据的 SQL 小提琴:http ://sqlfiddle.com/#!15/7492b/1
能否通过添加更好的索引来提高此查询的性能,还是必须增加处理能力?
使用 PostgreSQL 9.3.2 版。
我尝试了@Erwin 的提议EXISTS
:
SELECT group_id
FROM groups g
WHERE EXISTS (
SELECT 1
FROM counter c
WHERE c.group_id = g.group_id
AND ts BETWEEN timestamp '2014-03-02 00:00:00'
AND timestamp '2014-03-05 12:00:00'
);
Run Code Online (Sandbox Code Playgroud)
但不幸的是,这似乎并没有提高性能。查询计划:
"QUERY PLAN"
"Nested Loop Semi Join (cost=1607.18..371680.60 rows=113 width=4)"
" -> Seq Scan on groups g (cost=0.00..2.33 rows=133 width=4)"
" -> Bitmap Heap Scan on counter c (cost=1607.18..158895.53 rows=60641 width=4)"
" Recheck Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
" -> Bitmap Index Scan on comp_2_index (cost=0.00..1592.02 rows=60641 width=0)"
" Index Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)
来自 ypercube 的 LATERAL 查询的查询计划:
"QUERY PLAN"
"Nested Loop (cost=8.98..1200.42 rows=133 width=20)"
" -> Seq Scan on groups g (cost=0.00..2.33 rows=133 width=4)"
" -> Result (cost=8.98..8.99 rows=1 width=0)"
" One-Time Filter: ($1 IS NOT NULL)"
" InitPlan 1 (returns $1)"
" -> Limit (cost=0.56..4.49 rows=1 width=8)"
" -> Index Only Scan using comp_2_index on counter c (cost=0.56..1098691.21 rows=279808 width=8)"
" Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
" InitPlan 2 (returns $2)"
" -> Limit (cost=0.56..4.49 rows=1 width=8)"
" -> Index Only Scan Backward using comp_2_index on counter c_1 (cost=0.56..1098691.21 rows=279808 width=8)"
" Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)
另一个想法,也使用groups
表和称为LATERAL
join的结构(对于 SQL-Server 爱好者,这几乎与 相同OUTER APPLY
)。它的优点是可以在子查询中计算聚合:
SELECT group_id, min_ts, max_ts
FROM groups g, -- notice the comma here, is required
LATERAL
( SELECT MIN(ts) AS min_ts,
MAX(ts) AS max_ts
FROM counter c
WHERE c.group_id = g.group_id
AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
AND timestamp '2013-03-05 12:00:00'
) x
WHERE min_ts IS NOT NULL ;
Run Code Online (Sandbox Code Playgroud)
SQL-Fiddle 上的测试表明该查询对索引进行了(group_id, ts)
索引扫描。
类似的计划是使用 2 个横向连接生成的,一个用于最小值,一个用于最大值,还有 2 个内联相关子查询。如果您需要显示counter
除最小和最大日期之外的整行,也可以使用它们:
SELECT group_id,
min_ts, min_ts_id,
max_ts, max_ts_id
FROM groups g
, LATERAL
( SELECT ts AS min_ts, c.id AS min_ts_id
FROM counter c
WHERE c.group_id = g.group_id
AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
AND timestamp '2014-03-05 12:00:00'
ORDER BY ts ASC
LIMIT 1
) xmin
, LATERAL
( SELECT ts AS max_ts, c.id AS max_ts_id
FROM counter c
WHERE c.group_id = g.group_id
AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
AND timestamp '2014-03-05 12:00:00'
ORDER BY ts DESC
LIMIT 1
) xmax
WHERE min_ts IS NOT NULL ;
Run Code Online (Sandbox Code Playgroud)
对于仅“133 个不同的group_id
”,您可以使用integer
(甚至smallint
)。不过,不会买太多,因为填充到 8 字节会吃掉表中的其余部分和可能的索引。不过,普通的处理速度integer
要快一些。更多关于int4
vs. 的信息int2
:
创建表计数器( id bigserial 主键 , ts 时间戳 NOT NULL , group_id int NOT NULL );
@Leo:时间戳在现代 Postgres 中存储为 8 字节整数,并且可以非常快速地处理。看:
@ypercube:索引 on没有帮助,因为查询中(group_id, ts)
没有条件 on 。group_id
您的主要问题是必须处理大量数据:
使用计数器上的 ts_index 进行索引扫描(成本=0.56..467470.93行=194892宽度=4)
您只对a 的存在group_id
感兴趣,而不是实际计数。只有 133 个不同的group_id
s,因此您的查询可以满足gorup_id
时间范围内的第一个命中。因此,我建议使用EXISTS
表达式进行替代查询:
假设有一个组查找表:
SELECT group_id
FROM groups g
WHERE EXISTS (
SELECT counter c
WHERE c.group_id = g.group_id
AND ts BETWEEN timestamp '2014-03-02 00:00:00'
AND timestamp '2014-03-05 12:00:00'
);
Run Code Online (Sandbox Code Playgroud)
您的索引现在变得很有用comp_2_index
。(group_id, ts)
小提琴在评论中建立在 ypercube 小提琴上的
旧sqlfiddle
在这里,查询更喜欢 上的索引(ts, group_id)
,但我认为这是因为使用“聚集”时间戳的测试设置。如果您删除带有前导的索引ts
(更多相关信息),规划器也会很乐意使用该索引(group_id, ts)
- 特别是在仅索引扫描中。
如果可行,您可能不需要其他可能的改进:在物化视图中预先聚合数据以大幅减少行数。如果您还需要额外的实际计数,这尤其有意义。那么在更新 mv 时,您就有处理一次许多行的成本。您甚至可以组合每日和每小时的聚合(两个单独的表)并调整您的查询。
您的查询中的时间范围是任意的吗?或者主要是完整的分钟/小时/天?
CREATE MATERIALIZED VIEW counter_mv AS
SELECT date_trunc('hour', ts) AS hour
, group_id
, count(*)::int AS ct
FROM counter
GROUP BY 1,2
ORDER BY 1,2;
Run Code Online (Sandbox Code Playgroud)
创建必要的索引counter_mv
并调整您的查询以使用它。喜欢:
CREATE INDEX foo ON counter_mv (hour, group_id, ct); -- once
SELECT group_id, sum(ct) AS total_ct
FROM counter_mv
WHERE hour BETWEEN timestamp '2014-03-02 00:00:00'
AND timestamp '2014-03-05 12:00:00'
GROUP BY 1
ORDER BY 2;
Run Code Online (Sandbox Code Playgroud)
由于您在选择列表中没有聚合, the group by
the 与将 adistinct
放入选择列表几乎相同,对吗?
如果这是您想要的,您可以通过重写它以使用递归查询来在 comp_2_index 上进行快速索引查找,如PostgreSQL wiki 中所述。
创建一个视图以有效地返回不同的 group_ids:
create or replace view groups as
WITH RECURSIVE t AS (
SELECT min(counter.group_id) AS group_id
FROM counter
UNION ALL
SELECT ( SELECT min(counter.group_id) AS min
FROM counter
WHERE counter.group_id > t.group_id) AS min
FROM t
WHERE t.group_id IS NOT NULL
)
SELECT t.group_id
FROM t
WHERE t.group_id IS NOT NULL
UNION ALL
SELECT NULL::bigint AS col
WHERE (EXISTS ( SELECT counter.id,
counter.ts,
counter.group_id
FROM counter
WHERE counter.group_id IS NULL));
Run Code Online (Sandbox Code Playgroud)
然后在 Erwin 的exists
半连接中使用该视图代替查找表。
归档时间: |
|
查看次数: |
3081 次 |
最近记录: |