当 LIMIT 存在时，Postgres 12.1 使用仅索引向后扫描而不是索引

Question

当 LIMIT 存在时，Postgres 12.1 使用仅索引向后扫描而不是索引

Sjo*_*jon 3 postgresql performance postgresql-performance

我有一个中等大小的“functionCalls”表（约 4M 行），由 2 列组成，input并且function（另一个表的两个 id）：

\n\n

  Column  |  Type   | Collation | Nullable | Default \n----------+---------+-----------+----------+---------\n input    | integer |           | not null | \n function | integer |           | not null | \nIndexes:\n    "functionCall_pkey" PRIMARY KEY, btree (input, function) CLUSTER\n    "functionCallSearch" btree (function, input)\nForeign-key constraints:\n    "fkey1" FOREIGN KEY (function) REFERENCES function(id) ON UPDATE CASCADE ON DELETE CASCADE\n    "fkey2" FOREIGN KEY (input) REFERENCES input(id)\n\n

Run Code Online (Sandbox Code Playgroud)\n\n

我想找到与某个函数匹配的所有行，这就是我添加索引的原因functionCallSearch。这是我的查询：

\n\n

  Column  |  Type   | Collation | Nullable | Default \n----------+---------+-----------+----------+---------\n input    | integer |           | not null | \n function | integer |           | not null | \nIndexes:\n    "functionCall_pkey" PRIMARY KEY, btree (input, function) CLUSTER\n    "functionCallSearch" btree (function, input)\nForeign-key constraints:\n    "fkey1" FOREIGN KEY (function) REFERENCES function(id) ON UPDATE CASCADE ON DELETE CASCADE\n    "fkey2" FOREIGN KEY (input) REFERENCES input(id)\n\n

Run Code Online (Sandbox Code Playgroud)\n\n

这需要永远（当前〜20秒），因为pg拒绝使用索引，并决定做一个Index Only Scan Backward对主键执行 a 操作：

\n\n

 Limit  (cost=0.71..2178.97 rows=25 width=4) (actual time=12903.294..19142.568 rows=8 loops=1)\n   Output: c.input\n   Buffers: shared hit=59914 read=26193 written=54\n   ->  Nested Loop  (cost=0.71..135662.48 rows=1557 width=4) (actual time=12903.292..19142.561 rows=8 loops=1)\n         Output: c.input\n         Inner Unique: true\n         Join Filter: (c.function = function.id)\n         Rows Removed by Join Filter: 3649900\n         Buffers: shared hit=59914 read=26193 written=54\n         ->  Index Only Scan Backward using "functionCall_pkey" on public."functionCall" c  (cost=0.43..80906.80 rows=3650225 width=8) (actual time=0.040..17083.489 rows=3649908 loops=1)\n               Output: c.input, c.function\n               Heap Fetches: 3649909\n               Buffers: shared hit=59911 read=26193 written=54\n         ->  Materialize  (cost=0.28..2.30 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=3649908)\n               Output: function.id\n               Buffers: shared hit=3\n               ->  Index Scan using function_text on public.function  (cost=0.28..2.30 rows=1 width=4) (actual time=0.023..0.026 rows=1 loops=1)\n                     Output: function.id\n                     Index Cond: ((function.text)::text = \'getmyinode\'::text)\n                     Buffers: shared hit=3\n Planning Time: 0.392 ms\n Execution Time: 19143.967 ms\n

Run Code Online (Sandbox Code Playgroud)\n\n

当我删除LIMIT这个查询时，速度非常快：

\n\n

 Sort  (cost=5247.53..5251.42 rows=1557 width=4) (actual time=3.762..3.763 rows=8 loops=1)\n   Output: c.input\n   Sort Key: c.input DESC\n   Sort Method: quicksort  Memory: 25kB\n   Buffers: shared hit=6 read=4\n   ->  Nested Loop  (cost=0.71..5164.97 rows=1557 width=4) (actual time=0.099..3.739 rows=8 loops=1)\n         Output: c.input\n         Buffers: shared hit=6 read=4\n         ->  Index Scan using function_text on public.function  (cost=0.28..2.30 rows=1 width=4) (actual time=0.054..0.056 rows=1 loops=1)\n               Output: function.id\n               Index Cond: ((function.text)::text = \'getmyinode\'::text)\n               Buffers: shared hit=2 read=1\n         ->  Index Only Scan using "functionCallSearch" on public."functionCall" c  (cost=0.43..5103.71 rows=5897 width=8) (actual time=0.039..3.670 rows=8 loops=1)\n               Output: c.function, c.input\n               Index Cond: (c.function = function.id)\n               Heap Fetches: 8\n               Buffers: shared hit=4 read=3\n Planning Time: 0.514 ms\n Execution Time: 3.819 ms\n

Run Code Online (Sandbox Code Playgroud)\n\n

到目前为止我尝试过的事情：

\n\n

在阅读Postgres 有时使用劣质索引 WHERE a IN (\xe2\x80\xa6) ORDER BY b LIMIT N我已经检查过n_distinct- 但并不是那么遥远， pg_stats 说623while SELECT COUNT(*) FROM (SELECT DISTINCT function FROM "functionCall")returns 1065。
我已将表增加到SET STATISTICS10k 并运行ANALYZE。这将时间减少了一半（9 秒），但仍然不会使用我创建的索引。

\n\n

为什么是这样？我该如何解决这个问题？

\n

Answer 1

Lau*_*lbe 6

PostgreSQL 估计将有 1557 行满足条件，因此它认为如果避免显式排序，而是按顺序扫描行ORDER BY并执行嵌套循环连接，直到找到足够的行，速度会更快。

不幸的是，这根本行不通，PostgreSQL 必须以这种方式扫描整个表，因为总共只有 8 个匹配项。

问题似乎是估计值相当偏离：它认为public."functionCall"for的索引扫描function将产生 5897 行，而实际上只有 8 行。

作为第一个措施，尝试计算表上的新分布统计信息：

ANALYZE public."functionCall";

Run Code Online (Sandbox Code Playgroud)

如果仅此不能改善估计，请增加粒度：

ALTER TABLE public."functionCall" ALTER function SET STATISTICS 1000;
ANALYZE public."functionCall";

Run Code Online (Sandbox Code Playgroud)

这应该会改善估计。您也可以尝试使用高于 1000 的值。

如果全部失败，请明确告诉 PostgreSQL 不要使用该策略

ORDER BY c.input + 0 DESC

Run Code Online (Sandbox Code Playgroud)

仅供参考 - “分析”和“设置统计”（到 10k）都没有帮助。‘ORDER BY + 0’确实“有效” (4认同)
@Sjon它必须对匹配 `c.function = function.id` 的行数进行一般估计，而不知道 function.id 的具体值是什么。如果不同的“id”值有很大不同的计数，那么估计这将始终是一件有风险的事情。因此，ANALYZE 和 SET STATISTICS 没有多大帮助也就不足为奇了，因为问题在于跨表关联，而不是原始统计数据。 (2认同)

归档时间：	5 年，11 月前
查看次数：	2029 次
最近记录：	5 年，11 月前