当没有结果并且指定了 LIMIT 时，SELECT 非常慢

Question

当没有结果并且指定了 LIMIT 时，SELECT 非常慢

Tim*_*tin 7 postgresql performance index statistics postgresql-9.4 postgresql-performance

我遇到了一个问题，SELECT查询很慢，因为当最终结果的数量为 0并且LIMIT指定了一个子句时，它不使用索引。

如果结果数大于 0，则 Postgres 使用索引并在 ~ 1ms 内返回结果。据我所知，这似乎总是正确的。

如果结果数为 0 且没有LIMIT使用，则 Postgres 使用索引，结果在 ~ 1ms 内返回

如果结果数为 0 并且LIMIT指定了 a，则 Postgres 会执行顺序扫描，结果大约需要 13,000 毫秒。

为什么 PostgreSQL 在最后一种情况下不使用索引？

基数：

总计约 2100 万行。
~ 300 万行WHERE related_id=1
~ 300 万行WHERE related_id=1 AND platform=p1
2 行WHERE related_id=1 AND platform=p2
0 行WHERE related_id=1 AND platform=p3
~ 800 万行WHERE platform=p2

Postgres 版本：9.4.6

表架构：

CREATE TYPE platforms AS ENUM ('p1', 'p2', 'p3');

CREATE TABLE mytable (
    id bigint NOT NULL DEFAULT nextval('mytable_sq'::regclass),
    related_id integer NOT NULL,
    platform platforms NOT NULL DEFAULT 'default'::platforms,
    name character varying(200) NOT NULL,
    CONSTRAINT mytable_pkey PRIMARY KEY (id),
    CONSTRAINT mytable_related_id_fkey FOREIGN KEY (related_id)
         REFERENCES related (id)
);

CREATE INDEX related_id__platform__index ON mytable (related_id, platform);
CREATE UNIQUE INDEX some_other_index ON mytable (related_id, lower(name::text));

Run Code Online (Sandbox Code Playgroud)

查询和计划：

此查询返回 0 行：

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms
LIMIT 20;

 Limit  (cost=0.00..14.07 rows=20 width=737) (actual time=12863.465..12863.465 rows=0 loops=1)
    ->  Seq Scan on mytable  (cost=0.00..1492790.47 rows=2122653 width=737) (actual time=12863.464..12863.464 rows=0 loops=1)
          Filter: ((related_id = 1) AND (platform = 'p2'::platforms))
          Rows Removed by Filter: 21076656
 Planning time: 3.540 ms
 Execution time: 12868.190 ms

Run Code Online (Sandbox Code Playgroud)

此查询还返回 0 行：

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p2'::platforms;

 Bitmap Heap Scan on mytable  (cost=60533.63..1295799.94 rows=2122653 width=737) (actual time=0.890..0.890 rows=0 loops=1)
 Recheck Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
  ->  Bitmap Index Scan on related_id__platform__index  (cost=0.00..60002.97 rows=2122653 width=0) (actual time=0.888..0.888 rows=0 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p2'::platforms))
 Planning time: 0.827 ms
 Execution time: 1.104 ms

Run Code Online (Sandbox Code Playgroud)

这个查询返回 20 行（没有LIMIT它会超过 200 万行）：

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p1'::platforms
LIMIT 20;

 Limit  (cost=0.44..70.95 rows=20 width=737) (actual time=0.759..0.995 rows=20 loops=1)
   ->  Index Scan using related_id__platform__index on mytable  (cost=0.44..1217669.26 rows=345388 width=737) (actual time=0.759..0.993 rows=20 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p1'::platforms))
 Planning time: 5.776 ms
 Execution time: 2.476 ms

Run Code Online (Sandbox Code Playgroud)

此查询返回 2 行：

EXPLAIN ANALYZE
SELECT * FROM mytable
WHERE related_id=1 AND platform='p3'::platforms LIMIT 20;

 Limit  (cost=0.44..80.37 rows=20 width=737) (actual time=0.014..0.016 rows=2 loops=1)
   ->  Index Scan using related_id__platform__index on mytable  (cost=0.44..99497.62 rows=24894 width=737) (actual time=0.014..0.015 rows=2 loops=1)
         Index Cond: ((related_id = 1) AND (platform = 'p3'::platforms))
 Planning time: 0.972 ms
 Execution time: 0.123 ms

Run Code Online (Sandbox Code Playgroud)

Answer 1

Erw*_*ter 10

Postgres 在估计查询中谓词组合的频率方面做得很差：

SELECT * FROM tbl
WHERE  related_id = 1 AND platform = 'p2'::platforms
LIMIT  20;

Run Code Online (Sandbox Code Playgroud)

您的每个谓词本身都不是很有选择性 - Postgres 可以使用这些信息（“最常见的值”） - 假设您的统计数据是最新的：

总计约 2100 万行。
~ 300 万行WHERE related_id=1
~ 800 万行WHERE platform=p2

IOW，〜每第 7 行通过第一个过滤器，〜每第 3 行通过第 2 个。Postgres 进行了（天真的）数学运算，并预计大约每 20 行都有资格。由于没有ORDER BY，任何20 个符合条件的行都可以。最快的方法应该是按顺序扫描表并在大约 400 行之后完成——只有几个数据页，非常便宜。

使用任何索引都会增加一些额外的成本，Postgres 需要扫描索引和表。（例外：仅索引扫描，这在您的中是不可能的SELECT *）。只有在 Postgres 必须阅读足够多的额外页面以估计成本更高的情况下，这才会付出代价。这就是我将如何解释您看到对 small 进行顺序扫描LIMIT，但对 big （或没有）进行位图索引扫描LIMIT。

不幸的是，您的谓词组合出乎意料地罕见。Postgres 必须扫描整个表才能找到 2 个符合条件的行。（该指数实际上是很多在任何情况下便宜。）

2 行 WHERE related_id=1 AND platform=p2

Postgres无法使用多列中值的组合频率。想一想：收集这样的统计数据很快就会失控。

对于这种特殊情况，一个非常简单有效的解决方案：创建部分索引：

CREATE INDEX related_id_1_platform_2_idx ON tbl (id) WHERE related_id = 1 AND platform = 'p2'::platforms;
Run Code Online (Sandbox Code Playgroud)
这个超小的索引（2 行）不仅可以完美匹配您的查询，而且还可以为您的特定组合（输入pg_class.reltuples）提供计数估计值。实际的索引列与此无关，选择一个小列，通常最好将其设为 PK。

如果两个谓词中的一个可以更改，则有一种更通用的方法。假设related_id = 1是稳定条件，则创建：

CREATE INDEX related_id_1_idx ON tbl (platform) WHERE related_id = 1;
Run Code Online (Sandbox Code Playgroud)
索引列再次相关。这可能不足以倾斜规模，因为 Postgres 仅收集功能索引的索引列的完整统计信息（否则它依赖于基础表的统计信息）。我提议：

CREATE INDEX related_id_1_func_idx ON tbl ((platform::text::platforms)) -- double parens! WHERE related_id = 1;
Run Code Online (Sandbox Code Playgroud)
请注意额外的一对括号 - 强制转换速记的语法必要性。
表达platform::text::platforms实际上并没有改变任何东西-它蒙上你enum要text和背部。但它使 Postgres 收集有关（假定的）新值的完整统计信息。

现在，（之后ANALYZE tbl）我们有完整的统计数据，包括最常见的platformfor值related_id = 1。

检查：

SELECT * FROM pg_stats WHERE schemaname = 'public' -- actual schema AND tablename = 'related_id_1_func_idx'; -- actual idx name
Run Code Online (Sandbox Code Playgroud)
并且 Postgres 应该为您的情况选择索引 - 如果您在查询中重复相同的表达式。所以：

SELECT ... WHERE related_id = 1 AND platform::text::platforms = 'p2'::platforms;
Run Code Online (Sandbox Code Playgroud)
有关的：

未使用但影响查询的索引

关于Postgres 统计中最常见的值：

加速 GROUP BY, HAVING COUNT 查询

归档时间：	9 年，7 月前
查看次数：	3075 次
最近记录：	7 年，4 月前