为 WHERE 和 ORDER BY 创建多列索引

Question

为 WHERE 和 ORDER BY 创建多列索引

Iva*_*anD 6 postgresql index execution-plan

我正在尝试创建一个同时在 WHERE 和 ORDER BY 子句中使用的索引。阅读 Postgres 14 文档（11.4.索引和 ORDER BY - https://www.postgresql.org/docs/14/indexes-ordering.html）让我相信：

除了简单地查找查询要返回的行之外，索引还可以按特定的排序顺序传递它们。这允许遵守查询的 ORDER BY 规范，而无需单独的排序步骤。

哇，听起来棒极了，我们来试试吧！我创建了一个测试表，一个包含 WHERE 和 ORDER BY 列的索引，并用数据填充它：

DROP TABLE IF EXISTS testdata;
CREATE TABLE testdata
(
    question_id   TEXT        NOT NULL UNIQUE PRIMARY KEY,
    answerer_id   TEXT        NOT NULL,
    question_date TIMESTAMPTZ NOT NULL,
    answer_date   TIMESTAMPTZ NOT NULL
);

DROP INDEX IF EXISTS idx1;
CREATE INDEX idx1 ON testdata (answerer_id, answer_date, question_date);

TRUNCATE testdata;
INSERT INTO testdata(question_id, answerer_id, question_date, answer_date)
SELECT CONCAT('question_', LPAD(i::TEXT, 4, '0')),
       CONCAT('answerer_', LPAD(FLOOR(RANDOM() * (99 - 1 + 1) + 1)::TEXT, 2, '0')),
       TIMESTAMPTZ '2021-01-01' + RANDOM() * INTERVAL '365 days',
       TIMESTAMPTZ '2022-01-01' + RANDOM() * INTERVAL '365 days'
FROM GENERATE_SERIES(1, 9999) AS t(i);

VACUUM (FULL, ANALYZE) testdata;

EXPLAIN ANALYSE
SELECT *
FROM testdata
WHERE answerer_id = 'answerer_09'
ORDER BY answer_date,
         question_date;

Run Code Online (Sandbox Code Playgroud)

这是数据的示例。由于answerer_id是 1 到 99 之间的随机数，因此该查询应返回 10K 行中的约 100 行（约所有行的 10%）：

EXPLAIN ANALYSE查询的结果如下：

Sort  (cost=108.49..108.75 rows=106 width=42) (actual time=2.194..3.555 rows=106 loops=1)
  Sort Key: answer_date, question_date"
  Sort Method: quicksort  Memory: 33kB
  ->  Bitmap Heap Scan on testdata  (cost=5.11..104.92 rows=106 width=42) (actual time=0.057..1.188 rows=106 loops=1)
        Recheck Cond: (answerer_id = 'answerer_09'::text)
        Heap Blocks: exact=67
        ->  Bitmap Index Scan on idx1  (cost=0.00..5.08 rows=106 width=0) (actual time=0.032..0.040 rows=106 loops=1)
              Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.154 ms
Execution Time: 4.856 ms

Run Code Online (Sandbox Code Playgroud)

那么数据库使用索引来查找满足 WHERE 子句的行，然后......用快速排序对它们进行排序？为什么不返回与索引中已排序的行完全一样的行？

我错过了什么吗？也许我需要以其他方式创建索引才能在 WHERE 和 ORDER BY 中使用它？

更新：

将查询更改为：

Sort  (cost=108.49..108.75 rows=106 width=42) (actual time=2.194..3.555 rows=106 loops=1)
  Sort Key: answer_date, question_date"
  Sort Method: quicksort  Memory: 33kB
  ->  Bitmap Heap Scan on testdata  (cost=5.11..104.92 rows=106 width=42) (actual time=0.057..1.188 rows=106 loops=1)
        Recheck Cond: (answerer_id = 'answerer_09'::text)
        Heap Blocks: exact=67
        ->  Bitmap Index Scan on idx1  (cost=0.00..5.08 rows=106 width=0) (actual time=0.032..0.040 rows=106 loops=1)
              Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.154 ms
Execution Time: 4.856 ms

Run Code Online (Sandbox Code Playgroud)

彻底改变结果：

Limit  (cost=0.29..83.88 rows=30 width=42) (actual time=0.064..1.599 rows=30 loops=1)
  ->  Index Scan using idx1 on testdata  (cost=0.29..253.87 rows=91 width=42) (actual time=0.044..0.676 rows=30 loops=1)
        Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.125 ms
Execution Time: 1.967 ms

Run Code Online (Sandbox Code Playgroud)

如果我将限制更改为 40+，它将恢复使用排序（尽管是不同的类型：）top-N heapsort：

Limit  (cost=105.95..106.05 rows=40 width=42) (actual time=1.853..3.205 rows=40 loops=1)
  ->  Sort  (cost=105.95..106.17 rows=91 width=42) (actual time=1.837..2.321 rows=40 loops=1)
        Sort Key: answer_date, question_date"
        Sort Method: top-N heapsort  Memory: 30kB
        ->  Bitmap Heap Scan on testdata  (cost=4.99..103.07 rows=91 width=42) (actual time=0.054..1.037 rows=91 loops=1)
              Recheck Cond: (answerer_id = 'answerer_09'::text)
              Heap Blocks: exact=57
              ->  Bitmap Index Scan on idx1  (cost=0.00..4.97 rows=91 width=0) (actual time=0.034..0.042 rows=91 loops=1)
                    Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.093 ms
Execution Time: 3.618 ms

Run Code Online (Sandbox Code Playgroud)

因此索引是正确的，并且数据库知道它，但当它期望有未定义（无限制）或相当大的限制时忽略它。

这是什么原因呢？是因为某种方式排序而不使用索引更快吗？

Answer 1

Erw*_*ter 7

对于大约 10% 的行，运行索引扫描通常效率不高。（这里有很多因素在起作用......）您看到的是位图索引扫描。为什么？看：

当索引扫描是更好的选择时，Postgres 不使用索引

位图索引扫描无法将索引排序顺序保留到结果中。因此需要最后的排序步骤。

您可以“禁用”替代查询计划来“强制”索引扫描（仅用于测试目的！）：

SET enable_bitmapscan = off;
SET enable_seqscan = off;

Run Code Online (Sandbox Code Playgroud)

或者您可以通过以下方式降低随机访问的预期成本：

SET random_page_cost = 1;  -- or similar

Run Code Online (Sandbox Code Playgroud)

或者您可以LIMIT只添加几个结果行。

其中任何一个都可以说服查询规划器切换到索引扫描，而无需额外的排序步骤：

SET enable_bitmapscan = off;
SET enable_seqscan = off;

Run Code Online (Sandbox Code Playgroud)

db<>在这里摆弄

对于只有几行和轻度选择性谓词的测试用例，很难判断顺序扫描、位图索引扫描还是索引扫描是否会更快。使用更大的表进行的测试更具启发性。

无论哪种方式，查询规划器都会严格根据估计做出决定cost（设置SET enable_seqscan = off只会使顺序扫描看起来非常昂贵。）预计最便宜的计划获胜。表和列统计信息、服务器配置和成本设置应尽可能有效，以获得有效的估计和良好的查询计划。

归档时间：	4 年前
查看次数：	1225 次
最近记录：	3 年，11 月前