为什么 Postgres 不在简单的 GROUP BY 上使用索引？

Question

为什么 Postgres 不在简单的 GROUP BY 上使用索引？

Den*_*dov 10 postgresql indexing group-by

我创建了一个 36M 行的表，列上有一个索引type：

CREATE TABLE items AS
  SELECT
    (random()*36000000)::integer AS id,
    (random()*10000)::integer AS type,
    md5(random()::text) AS s
  FROM
    generate_series(1,36000000);
CREATE INDEX items_type_idx ON items USING btree ("type");

Run Code Online (Sandbox Code Playgroud)

我运行这个简单的查询并期望 postgresql 使用我的索引：

explain select count(*) from "items" group by "type";

Run Code Online (Sandbox Code Playgroud)

但是查询计划器决定使用 Seq Scan 代替：

HashAggregate  (cost=734592.00..734627.90 rows=3590 width=12) (actual time=6477.913..6478.344 rows=3601 loops=1)
  Group Key: type
  ->  Seq Scan on items  (cost=0.00..554593.00 rows=35999800 width=4) (actual time=0.044..1820.522 rows=36000000 loops=1)
Planning time: 0.107 ms
Execution time: 6478.525 ms

Run Code Online (Sandbox Code Playgroud)

无解释时间： 5s 979ms

我从这里和这里尝试了几种解决方案：

运行VACUUM ANALYZE或VACUUM ANALYZE
配置default_statistics_target, random_page_cost,work_mem

但除了设置之外没有任何帮助enable_seqscan = OFF：

SET enable_seqscan = OFF;
explain select count(*) from "items" group by "type";

GroupAggregate  (cost=0.56..1114880.46 rows=3590 width=12) (actual time=5.637..5256.406 rows=3601 loops=1)
  Group Key: type
  ->  Index Only Scan using items_type_idx on items  (cost=0.56..934845.56 rows=35999800 width=4) (actual time=0.074..2783.896 rows=36000000 loops=1)
        Heap Fetches: 0
Planning time: 0.103 ms
Execution time: 5256.667 ms

Run Code Online (Sandbox Code Playgroud)

无解释时间： 659ms

在我的机器上使用索引扫描查询大约快 10 倍。

有没有比设置更好的解决方案enable_seqscan？

UPD1

我的postgresql版本是9.6.3，work_mem = 4MB（试过64MB），random_page_cost = 4（试过1.1），max_parallel_workers_per_gather = 0（试过4）。

UPD2

我试图不使用随机数填充类型列，而是使用i / 10000to make pg_stats.correlation= 1 - 仍然是 seqscan。

UPD3

@jgh 是 100% 正确的：

这通常仅在表的行宽比某些索引宽得多时才会发生

我制作了大列data，现在 postgres 使用索引。谢谢大家！

Answer 1

JGH*_*JGH 6

该索引只扫描维基说：

重要的是要意识到规划器关心的是最小化查询的总成本。对于数据库，I/O 成本通常占主导地位。出于这个原因，“count(*) without any predicate”查询将仅在索引明显小于其表时使用仅索引扫描。这通常仅在表的行宽比某些索引宽得多时才会发生。

和

仅当规划器根据其基于成本的不完善建模推测这将减少所需的 I/O 总量时，才使用仅索引扫描。这一切都在很大程度上取决于元组的可见性，是否会使用索引（即谓词的选择性如何等），以及原则上是否存在可供仅索引扫描使用的索引

因此，您的索引不会被视为“明显更小”，而是要读取整个数据集，这会导致规划器使用 seq 扫描

知道这一点很好，但仍然有点令人困惑。我有一个 56GB 的表和一个 1.5GB 的列索引。当我仅使用索引列运行组时，查询计划程序仍然绕过索引并使用 seq 扫描。 (4认同)

归档时间：	8 年，7 月前
查看次数：	8523 次
最近记录：	8 年，7 月前