为 WHERE 和 ORDER BY 创建多列索引

Iva*_*anD 6 postgresql index execution-plan

我正在尝试创建一个同时在 WHERE 和 ORDER BY 子句中使用的索引。阅读 Postgres 14 文档(11.4.索引和 ORDER BY - https://www.postgresql.org/docs/14/indexes-ordering.html)让我相信:

除了简单地查找查询要返回的行之外,索引还可以按特定的排序顺序传递它们。这允许遵守查询的 ORDER BY 规范,而无需单独的排序步骤。

哇,听起来棒极了,我们来试试吧!我创建了一个测试表,一个包含 WHERE 和 ORDER BY 列的索引,并用数据填充它:

DROP TABLE IF EXISTS testdata;
CREATE TABLE testdata
(
    question_id   TEXT        NOT NULL UNIQUE PRIMARY KEY,
    answerer_id   TEXT        NOT NULL,
    question_date TIMESTAMPTZ NOT NULL,
    answer_date   TIMESTAMPTZ NOT NULL
);

DROP INDEX IF EXISTS idx1;
CREATE INDEX idx1 ON testdata (answerer_id, answer_date, question_date);

TRUNCATE testdata;
INSERT INTO testdata(question_id, answerer_id, question_date, answer_date)
SELECT CONCAT('question_', LPAD(i::TEXT, 4, '0')),
       CONCAT('answerer_', LPAD(FLOOR(RANDOM() * (99 - 1 + 1) + 1)::TEXT, 2, '0')),
       TIMESTAMPTZ '2021-01-01' + RANDOM() * INTERVAL '365 days',
       TIMESTAMPTZ '2022-01-01' + RANDOM() * INTERVAL '365 days'
FROM GENERATE_SERIES(1, 9999) AS t(i);

VACUUM (FULL, ANALYZE) testdata;

EXPLAIN ANALYSE
SELECT *
FROM testdata
WHERE answerer_id = 'answerer_09'
ORDER BY answer_date,
         question_date;
Run Code Online (Sandbox Code Playgroud)

这是数据的示例。由于answerer_id是 1 到 99 之间的随机数,因此该查询应返回 10K 行中的约 100 行(约所有行的 10%):

在此输入图像描述

EXPLAIN ANALYSE查询的结果如下:

Sort  (cost=108.49..108.75 rows=106 width=42) (actual time=2.194..3.555 rows=106 loops=1)
  Sort Key: answer_date, question_date"
  Sort Method: quicksort  Memory: 33kB
  ->  Bitmap Heap Scan on testdata  (cost=5.11..104.92 rows=106 width=42) (actual time=0.057..1.188 rows=106 loops=1)
        Recheck Cond: (answerer_id = 'answerer_09'::text)
        Heap Blocks: exact=67
        ->  Bitmap Index Scan on idx1  (cost=0.00..5.08 rows=106 width=0) (actual time=0.032..0.040 rows=106 loops=1)
              Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.154 ms
Execution Time: 4.856 ms
Run Code Online (Sandbox Code Playgroud)

那么数据库使用索引来查找满足 WHERE 子句的行,然后......用快速排序对它们进行排序?为什么不返回与索引中已排序的行完全一样的行?

我错过了什么吗?也许我需要以其他方式创建索引才能在 WHERE 和 ORDER BY 中使用它?

更新:

将查询更改为:

Sort  (cost=108.49..108.75 rows=106 width=42) (actual time=2.194..3.555 rows=106 loops=1)
  Sort Key: answer_date, question_date"
  Sort Method: quicksort  Memory: 33kB
  ->  Bitmap Heap Scan on testdata  (cost=5.11..104.92 rows=106 width=42) (actual time=0.057..1.188 rows=106 loops=1)
        Recheck Cond: (answerer_id = 'answerer_09'::text)
        Heap Blocks: exact=67
        ->  Bitmap Index Scan on idx1  (cost=0.00..5.08 rows=106 width=0) (actual time=0.032..0.040 rows=106 loops=1)
              Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.154 ms
Execution Time: 4.856 ms
Run Code Online (Sandbox Code Playgroud)

彻底改变结果:

Limit  (cost=0.29..83.88 rows=30 width=42) (actual time=0.064..1.599 rows=30 loops=1)
  ->  Index Scan using idx1 on testdata  (cost=0.29..253.87 rows=91 width=42) (actual time=0.044..0.676 rows=30 loops=1)
        Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.125 ms
Execution Time: 1.967 ms
Run Code Online (Sandbox Code Playgroud)

如果我将限制更改为 40+,它将恢复使用排序(尽管是不同的类型:)top-N heapsort

Limit  (cost=105.95..106.05 rows=40 width=42) (actual time=1.853..3.205 rows=40 loops=1)
  ->  Sort  (cost=105.95..106.17 rows=91 width=42) (actual time=1.837..2.321 rows=40 loops=1)
        Sort Key: answer_date, question_date"
        Sort Method: top-N heapsort  Memory: 30kB
        ->  Bitmap Heap Scan on testdata  (cost=4.99..103.07 rows=91 width=42) (actual time=0.054..1.037 rows=91 loops=1)
              Recheck Cond: (answerer_id = 'answerer_09'::text)
              Heap Blocks: exact=57
              ->  Bitmap Index Scan on idx1  (cost=0.00..4.97 rows=91 width=0) (actual time=0.034..0.042 rows=91 loops=1)
                    Index Cond: (answerer_id = 'answerer_09'::text)
Planning Time: 0.093 ms
Execution Time: 3.618 ms
Run Code Online (Sandbox Code Playgroud)

因此索引是正确的,并且数据库知道它,但当它期望有未定义(无限制)或相当大的限制时忽略它。

这是什么原因呢?是因为某种方式排序而不使用索引更快吗?

Erw*_*ter 7

对于大约 10% 的行,运行索引扫描通常效率不高。(这里有很多因素在起作用......)您看到的是位图索引扫描。为什么?看:

位图索引扫描无法将索引排序顺序保留到结果中。因此需要最后的排序步骤。

您可以“禁用”替代查询计划来“强制”索引扫描(仅用于测试目的!):

SET enable_bitmapscan = off;
SET enable_seqscan = off;
Run Code Online (Sandbox Code Playgroud)

或者您可以通过以下方式降低随机访问的预期成本:

SET random_page_cost = 1;  -- or similar
Run Code Online (Sandbox Code Playgroud)

或者您可以LIMIT只添加几个结果行。

其中任何一个都可以说服查询规划器切换到索引扫描,而无需额外的排序步骤

SET enable_bitmapscan = off;
SET enable_seqscan = off;
Run Code Online (Sandbox Code Playgroud)

db<>在这里摆弄

对于只有几行和轻度选择性谓词的测试用例,很难判断顺序扫描、位图索引扫描还是索引扫描是否会更快。使用更大的表进行的测试更具启发性。

无论哪种方式,查询规划器都会严格根据估计做出决定cost(设置SET enable_seqscan = off只会使顺序扫描看起来非常昂贵。)预计最便宜的计划获胜。表和列统计信息、服务器配置和成本设置应尽可能有效,以获得有效的估计和良好的查询计划。