带有 WHERE 条件和 GROUP BY 的 SQL 查询索引

Question

带有 WHERE 条件和 GROUP BY 的 SQL 查询索引

uld*_*all 15 postgresql performance index optimization postgresql-9.3 query-performance

我正在尝试确定哪些索引用于带有WHERE条件的 SQL 查询，GROUP BY而当前运行速度很慢。

我的查询：

SELECT group_id
FROM counter
WHERE ts between timestamp '2014-03-02 00:00:00.0' and timestamp '2014-03-05 12:00:00.0'
GROUP BY group_id

Run Code Online (Sandbox Code Playgroud)

该表目前有 32.000.000 行。当我增加时间范围时，查询的执行时间会增加很多。

有问题的表如下所示：

CREATE TABLE counter (
    id bigserial PRIMARY KEY
  , ts timestamp NOT NULL
  , group_id bigint NOT NULL
);

Run Code Online (Sandbox Code Playgroud)

我目前有以下索引，但性能仍然很慢：

CREATE INDEX ts_index
  ON counter
  USING btree
  (ts);

CREATE INDEX group_id_index
  ON counter
  USING btree
  (group_id);

CREATE INDEX comp_1_index
  ON counter
  USING btree
  (ts, group_id);

CREATE INDEX comp_2_index
  ON counter
  USING btree
  (group_id, ts);

Run Code Online (Sandbox Code Playgroud)

对查询运行 EXPLAIN 会得到以下结果：

"QUERY PLAN"
"HashAggregate  (cost=467958.16..467958.17 rows=1 width=4)"
"  ->  Index Scan using ts_index on counter  (cost=0.56..467470.93 rows=194892 width=4)"
"        Index Cond: ((ts >= '2014-02-26 00:00:00'::timestamp without time zone) AND (ts <= '2014-02-27 23:59:00'::timestamp without time zone))"

Run Code Online (Sandbox Code Playgroud)

带有示例数据的 SQL 小提琴：http ://sqlfiddle.com/#!15/7492b/1

问题

能否通过添加更好的索引来提高此查询的性能，还是必须增加处理能力？

编辑 1

使用 PostgreSQL 9.3.2 版。

编辑 2

我尝试了@Erwin 的提议EXISTS：

SELECT group_id
FROM   groups g
WHERE  EXISTS (
   SELECT 1
   FROM   counter c
   WHERE  c.group_id = g.group_id
   AND    ts BETWEEN timestamp '2014-03-02 00:00:00'
                 AND timestamp '2014-03-05 12:00:00'
   );

Run Code Online (Sandbox Code Playgroud)

但不幸的是，这似乎并没有提高性能。查询计划：

"QUERY PLAN"
"Nested Loop Semi Join  (cost=1607.18..371680.60 rows=113 width=4)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Bitmap Heap Scan on counter c  (cost=1607.18..158895.53 rows=60641 width=4)"
"        Recheck Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        ->  Bitmap Index Scan on comp_2_index  (cost=0.00..1592.02 rows=60641 width=0)"
"              Index Cond: ((group_id = g.id) AND (ts >= '2014-01-01 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Run Code Online (Sandbox Code Playgroud)

编辑 3

来自 ypercube 的 LATERAL 查询的查询计划：

"QUERY PLAN"
"Nested Loop  (cost=8.98..1200.42 rows=133 width=20)"
"  ->  Seq Scan on groups g  (cost=0.00..2.33 rows=133 width=4)"
"  ->  Result  (cost=8.98..8.99 rows=1 width=0)"
"        One-Time Filter: ($1 IS NOT NULL)"
"        InitPlan 1 (returns $1)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan using comp_2_index on counter c  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"
"        InitPlan 2 (returns $2)"
"          ->  Limit  (cost=0.56..4.49 rows=1 width=8)"
"                ->  Index Only Scan Backward using comp_2_index on counter c_1  (cost=0.56..1098691.21 rows=279808 width=8)"
"                      Index Cond: ((group_id = $0) AND (ts IS NOT NULL) AND (ts >= '2010-03-02 00:00:00'::timestamp without time zone) AND (ts <= '2014-03-05 12:00:00'::timestamp without time zone))"

Run Code Online (Sandbox Code Playgroud)

Answer 1

ype*_*eᵀᴹ 6

另一个想法，也使用groups表和称为LATERALjoin的结构（对于 SQL-Server 爱好者，这几乎与相同OUTER APPLY）。它的优点是可以在子查询中计算聚合：

SELECT group_id, min_ts, max_ts
FROM   groups g,                    -- notice the comma here, is required
  LATERAL 
       ( SELECT MIN(ts) AS min_ts,
                MAX(ts) AS max_ts
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2011-03-02 00:00:00'
                        AND timestamp '2013-03-05 12:00:00'
       ) x 
WHERE min_ts IS NOT NULL ;

Run Code Online (Sandbox Code Playgroud)

SQL-Fiddle 上的测试表明该查询对索引进行了(group_id, ts)索引扫描。

类似的计划是使用 2 个横向连接生成的，一个用于最小值，一个用于最大值，还有 2 个内联相关子查询。如果您需要显示counter除最小和最大日期之外的整行，也可以使用它们：

SELECT group_id, 
       min_ts, min_ts_id, 
       max_ts, max_ts_id 
FROM   groups g
  , LATERAL 
       ( SELECT ts AS min_ts, c.id AS min_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts ASC
         LIMIT 1
       ) xmin
  , LATERAL 
       ( SELECT ts AS max_ts, c.id AS max_ts_id
         FROM counter c
         WHERE c.group_id = g.group_id
           AND c.ts BETWEEN timestamp '2012-03-02 00:00:00'
                        AND timestamp '2014-03-05 12:00:00'
         ORDER BY ts DESC 
         LIMIT 1
       ) xmax
WHERE min_ts IS NOT NULL ;

Run Code Online (Sandbox Code Playgroud)

Answer 2

Erw*_*ter 5

对于仅“133 个不同的group_id”，您可以使用integer（甚至smallint）。不过，不会买太多，因为填充到 8 字节会吃掉表中的其余部分和可能的索引。不过，普通的处理速度integer要快一些。更多关于int4vs. 的信息int2：

使用整数代替间隔（一种类型）

创建表计数器（
    id bigserial 主键
  , ts 时间戳 NOT NULL
  , group_id int NOT NULL
）；

@Leo：时间戳在现代 Postgres 中存储为 8 字节整数，并且可以非常快速地处理。看：

在 Rails 和 PostgreSQL 中完全忽略时区

@ypercube：索引 on没有帮助，因为查询中(group_id, ts)没有条件 on 。group_id

您的主要问题是必须处理大量数据：

使用计数器上的 ts_index 进行索引扫描（成本=0.56..467470.93行=194892宽度=4）

您只对a 的存在group_id感兴趣，而不是实际计数。只有 133 个不同的group_ids，因此您的查询可以满足gorup_id时间范围内的第一个命中。因此，我建议使用EXISTS表达式进行替代查询：

假设有一个组查找表：

SELECT group_id
FROM   groups g
WHERE  EXISTS (
   SELECT counter c
   WHERE  c.group_id = g.group_id
   AND    ts BETWEEN timestamp '2014-03-02 00:00:00'
                 AND timestamp '2014-03-05 12:00:00'
   );

Run Code Online (Sandbox Code Playgroud)

您的索引现在变得很有用comp_2_index。(group_id, ts)

小提琴_{在评论中建立在 ypercube 小提琴上的}
_旧sqlfiddle

在这里，查询更喜欢上的索引(ts, group_id)，但我认为这是因为使用“聚集”时间戳的测试设置。如果您删除带有前导的索引ts（更多相关信息），规划器也会很乐意使用该索引(group_id, ts)- 特别是在仅索引扫描中。

如果可行，您可能不需要其他可能的改进：在物化视图中预先聚合数据以大幅减少行数。如果您还需要额外的实际计数，这尤其有意义。那么在更新 mv 时，您就有处理一次许多行的成本。您甚至可以组合每日和每小时的聚合（两个单独的表）并调整您的查询。

您的查询中的时间范围是任意的吗？或者主要是完整的分钟/小时/天？

CREATE MATERIALIZED VIEW counter_mv AS
SELECT date_trunc('hour', ts) AS hour
     , group_id
     , count(*)::int AS ct
FROM  counter
GROUP BY 1,2
ORDER BY 1,2;

Run Code Online (Sandbox Code Playgroud)

创建必要的索引counter_mv并调整您的查询以使用它。喜欢：

CREATE INDEX foo ON counter_mv (hour, group_id, ct);  -- once

SELECT group_id, sum(ct) AS total_ct
FROM   counter_mv
WHERE  hour BETWEEN timestamp '2014-03-02 00:00:00'
                AND timestamp '2014-03-05 12:00:00'
GROUP  BY 1
ORDER  BY 2;

Run Code Online (Sandbox Code Playgroud)

Answer 3

jja*_*nes 5

由于您在选择列表中没有聚合， the group bythe 与将 adistinct放入选择列表几乎相同，对吗？

如果这是您想要的，您可以通过重写它以使用递归查询来在 comp_2_index 上进行快速索引查找，如PostgreSQL wiki 中所述。

创建一个视图以有效地返回不同的 group_ids：

create or replace view groups as
WITH RECURSIVE t AS (
             SELECT min(counter.group_id) AS group_id
               FROM counter
    UNION ALL
             SELECT ( SELECT min(counter.group_id) AS min
                       FROM counter
                      WHERE counter.group_id > t.group_id) AS min
               FROM t
              WHERE t.group_id IS NOT NULL
    )
     SELECT t.group_id
       FROM t
      WHERE t.group_id IS NOT NULL
UNION ALL
     SELECT NULL::bigint AS col
      WHERE (EXISTS ( SELECT counter.id,
                counter.ts,
                counter.group_id
               FROM counter
              WHERE counter.group_id IS NULL));

Run Code Online (Sandbox Code Playgroud)

然后在 Erwin 的exists半连接中使用该视图代替查找表。

归档时间：	11 年，8 月前
查看次数：	3081 次
最近记录：	11 年，7 月前