带有 WHERE 条件和 GROUP BY 的 SQL 查询索引

uld*_*all 15 postgresql performance index optimization postgresql-9.3 query-performance

我正在尝试确定哪些索引用于带有WHERE条件的 SQL 查询,GROUP BY而当前运行速度很慢。

我的查询:

SELECT group_id
FROM counter
WHERE ts between timestamp '2014-03-02 00:00:00.0' and timestamp '2014-03-05 12:00:00.0'
GROUP BY group_id
Run Code Online (Sandbox Code Playgroud)

该表目前有 32.000.000 行。当我增加时间范围时,查询的执行时间会增加很多。

有问题的表如下所示:

CREATE TABLE counter (
    id bigserial PRIMARY KEY
  , ts timestamp NOT NULL
  , group_id bigint NOT NULL
);
Run Code Online (Sandbox Code Playgroud)

我目前有以下索引,但性能仍然很慢:

CREATE INDEX ts_index
  ON counter
  USING btree
  (ts);

CREATE INDEX group_id_index
  ON counter
  USING btree
  (group_id);

CREATE INDEX comp_1_index
  ON counter
  USING btree
  (ts, group_id);

CREATE INDEX comp_2_index
  ON counter
  USING btree
  (group_id, ts);
Run Code Online (Sandbox Code Playgroud)

对查询运行 EXPLAIN 会得到以下结果:

"QUERY PLAN"
"HashAggregate  (cost=467958.16..467958.17 rows=1 width=4)"
"  ->  Index Scan using ts_index on counter  (cost=0.56..467470.93 rows=194892 width=4)"
"        Index Cond: ((ts >= '2014-02-26 00:00:00'::timestamp without time zone) AND (ts <= '2014-02-27 23:59:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)

带有示例数据的 SQL 小提琴:http ://sqlfiddle.com/#!15/7492b/1

问题

能否通过添加更好的索引来提高此查询的性能,还是必须增加处理能力?

编辑 1

使用 PostgreSQL 9.3.2 版。

编辑 2

我尝试了@Erwin 的提议EXISTS

SELECT group_id
FROM   groups g
WHERE  EXISTS (
   SELECT 1
   FROM   counter c
   WHERE  c.group_id = g.group_id
   AND    ts BETWEEN timestamp '2014-03-02 00:00:00'
                 AND timestamp '2014-03-05 12:00:00'
   );
Run Code Online (Sandbox Code Playgroud)

但不幸的是,这似乎并没有提高性能。查询计划:

"QUERY PLAN"
"Nested Loop Semi Join  (cost=1607.18..371680.60 rows=113 width=4)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Bitmap Heap Scan on counter c  (cost=1607.18..158895.53 rows=60641 width=4)"
"        Recheck Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on comp_2_index  (cost=0.00..1592.02 rows=60641 width=0)"
"              Index Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)

编辑 3

来自 ypercube 的 LATERAL 查询的查询计划:

"QUERY PLAN"
"Nested Loop  (cost=8.98..1200.42 rows=133 width=20)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Result  (cost=8.98..8.99 rows=1 width=0)"
"        One-Time Filter: ($1 IS NOT NULL)"
"        InitPlan 1 (returns $1)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan using comp_2_index on counter c  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        InitPlan 2 (returns $2)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan Backward using comp_2_index on counter c_1  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
Run Code Online (Sandbox Code Playgroud)

ype*_*eᵀᴹ 6

另一个想法,也使用groups表和称为LATERALjoin的结构(对于 SQL-Server 爱好者,这几乎与 相同OUTER APPLY)。它的优点是可以在子查询中计算聚合:

SELECT group_id, min_ts, max_ts
FROM   groups g,                    -- notice the comma here, is required
  LATERAL 
       ( SELECT MIN(ts) AS min_ts,
                MAX(ts) AS max_ts
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
                        AND timestamp '2013-03-05 12:00:00'
       ) x 
WHERE min_ts IS NOT NULL ;
Run Code Online (Sandbox Code Playgroud)

SQL-Fiddle 上的测试表明该查询对索引进行了(group_id, ts)索引扫描。

类似的计划是使用 2 个横向连接生成的,一个用于最小值,一个用于最大值,还有 2 个内联相关子查询。如果您需要显示counter除最小和最大日期之外的整行,也可以使用它们:

SELECT group_id, 
       min_ts, min_ts_id, 
       max_ts, max_ts_id 
FROM   groups g
  , LATERAL 
       ( SELECT ts AS min_ts, c.id AS min_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts ASC
         LIMIT 1
       ) xmin
  , LATERAL 
       ( SELECT ts AS max_ts, c.id AS max_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts DESC 
         LIMIT 1
       ) xmax
WHERE min_ts IS NOT NULL ;
Run Code Online (Sandbox Code Playgroud)


Erw*_*ter 5

对于仅“133 个不同的group_id,您可以使用integer(甚至smallint)。不过,不会买太多,因为填充到 8 字节会吃掉表中的其余部分和可能的索引。不过,普通的处理速度integer要快一些。更多关于int4vs. 的信息int2

创建表计数器(
    id bigserial 主键
  , ts 时间戳 NOT NULL
  , group_id int NOT NULL
);

@Leo:时间戳在现代 Postgres 中存储为 8 字节整数,并且可以非常快速地处理。看:

@ypercube:索引 on没有帮助,因为查询中(group_id, ts)没有条件 on 。group_id

您的主要问题是必须处理大量数据:

使用计数器上的 ts_index 进行索引扫描(成本=0.56..467470.93行=194892宽度=4)

您只对a 的存在group_id感兴趣,而不是实际计数。只有 133 个不同的group_ids,因此您的查询可以满足gorup_id时间范围内的第一个命中。因此,我建议使用EXISTS表达式进行替代查询:

假设有一个组查找表:

SELECT group_id
FROM   groups g
WHERE  EXISTS (
   SELECT counter c
   WHERE  c.group_id = g.group_id
   AND    ts BETWEEN timestamp '2014-03-02 00:00:00'
                 AND timestamp '2014-03-05 12:00:00'
   );
Run Code Online (Sandbox Code Playgroud)

您的索引现在变得很有用comp_2_index(group_id, ts)

小提琴在评论中建立在 ypercube 小提琴上的
sqlfiddle

在这里,查询更喜欢 上的索引(ts, group_id),但我认为这是因为使用“聚集”时间戳的测试设置。如果您删除带有前导的索引ts更多相关信息),规划器也会很乐意使用该索引(group_id, ts)- 特别是在仅索引扫描中。

如果可行,您可能不需要其他可能的改进:在物化视图中预先聚合数据以大幅减少行数。如果您还需要额外的实际计数,这尤其有意义。那么在更新 mv 时,您就有处理一次许多行的成本。您甚至可以组合每日和每小时的聚合(两个单独的表)并调整您的查询。

您的查询中的时间范围是任意的吗?或者主要是完整的分钟/小时/天?

CREATE MATERIALIZED VIEW counter_mv AS
SELECT date_trunc('hour', ts) AS hour
     , group_id
     , count(*)::int AS ct
FROM  counter
GROUP BY 1,2
ORDER BY 1,2;
Run Code Online (Sandbox Code Playgroud)

创建必要的索引counter_mv并调整您的查询以使用它。喜欢:

CREATE INDEX foo ON counter_mv (hour, group_id, ct);  -- once

SELECT group_id, sum(ct) AS total_ct
FROM   counter_mv
WHERE  hour BETWEEN timestamp '2014-03-02 00:00:00'
                AND timestamp '2014-03-05 12:00:00'
GROUP  BY 1
ORDER  BY 2;
Run Code Online (Sandbox Code Playgroud)


jja*_*nes 5

由于您在选择列表中没有聚合, the group bythe 与将 adistinct放入选择列表几乎相同,对吗?

如果这是您想要的,您可以通过重写它以使用递归查询来在 comp_2_index 上进行快速索引查找,如PostgreSQL wiki 中所述

创建一个视图以有效地返回不同的 group_ids:

create or replace view groups as
WITH RECURSIVE t AS (
             SELECT min(counter.group_id) AS group_id
               FROM counter
    UNION ALL
             SELECT ( SELECT min(counter.group_id) AS min
                       FROM counter
                      WHERE counter.group_id > t.group_id) AS min
               FROM t
              WHERE t.group_id IS NOT NULL
    )
     SELECT t.group_id
       FROM t
      WHERE t.group_id IS NOT NULL
UNION ALL
     SELECT NULL::bigint AS col
      WHERE (EXISTS ( SELECT counter.id,
                counter.ts,
                counter.group_id
               FROM counter
              WHERE counter.group_id IS NULL));
Run Code Online (Sandbox Code Playgroud)

然后在 Erwin 的exists半连接中使用该视图代替查找表。