如何获得 OR'ed 时间范围谓词的索引扫描？

Question

如何获得 OR'ed 时间范围谓词的索引扫描？

Dmi*_*tro 4 postgresql index range-types query-performance postgresql-performance

我有events包含字段的表：

id
user_id
time_start
time_end
...

Run Code Online (Sandbox Code Playgroud)

并在上有 B 树索引(time_start, time_end)。

SELECT user_id
FROM events
WHERE ((time_start <= '2021-08-24T15:30:00+00:00' AND time_end >= '2021-08-24T15:30:00+00:00') OR
       (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id);

Run Code Online (Sandbox Code Playgroud)

Group  (cost=243735.42..243998.32 rows=1103 width=4) (actual time=186.533..188.244 rows=166 loops=1)
  Group Key: user_id
  Buffers: shared hit=224848
  ->  Gather Merge  (cost=243735.42..243992.80 rows=2206 width=4) (actual time=186.532..188.199 rows=176 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=224848
        ->  Sort  (cost=242735.39..242738.15 rows=1103 width=4) (actual time=184.121..184.126 rows=59 loops=3)
              Sort Key: user_id
              Sort Method: quicksort  Memory: 27kB
              Worker 0:  Sort Method: quicksort  Memory: 27kB
              Worker 1:  Sort Method: quicksort  Memory: 28kB
              Buffers: shared hit=224848
              ->  Partial HashAggregate  (cost=242668.62..242679.65 rows=1103 width=4) (actual time=184.065..184.085 rows=59 loops=3)
                    Group Key: user_id
                    Buffers: shared hit=224834
                    ->  Parallel Seq Scan on events  (cost=0.00..242553.74 rows=45952 width=4) (actual time=104.085..183.994 rows=64 loops=3)
                          Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))
                          Rows Removed by Filter: 708728
                          Buffers: shared hit=224834
Planning Time: 0.169 ms
Execution Time: 188.294 ms

Run Code Online (Sandbox Code Playgroud)

Postgres 与过滤器一起使用Seq Scan：

Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))

Run Code Online (Sandbox Code Playgroud)

但是当我留下一个条件时time_start，time_end它就开始使用索引扫描。

如何更改条件以使 Postgres 使用索引扫描而不是顺序扫描？

我不想使用UNION像：

SELECT user_id
FROM events
WHERE (
     (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id)
UNION (SELECT user_id
       FROM events
       WHERE (
           (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
   GROUP BY user_id

Run Code Online (Sandbox Code Playgroud)

Answer 1

Erw*_*ter 6

表达指数

包含时间戳范围的GiST或（甚至更好）SP-GiST表达式索引应该会产生奇迹。

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));

Run Code Online (Sandbox Code Playgroud)

使用“范围包含”运算符重写您的查询并匹配索引表达式（与原始表达式完全相同）：@>

SELECT user_id
FROM   events
WHERE  tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:30:00'
    OR tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:59:00'
GROUP  BY user_id;

Run Code Online (Sandbox Code Playgroud)

您将得到如下查询计划：

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));

Run Code Online (Sandbox Code Playgroud)

应该会快很多。

除非另有说明，范围类型假定包含下限和排除上限。tsrange(time_start, time_end)是相同的tsrange(time_start, time_end), '[)'）。
由于您使用>=和进行操作<=，因此请使用来包含两个边界tsrange(time_start, time_end, '[]')。

有关的：

或者，将范围列存储在表中

不过，作为普通（非表达式）索引，应该会快一点。
您可以将时间戳范围列添加到表中，例如：

ALTER TABLE event ADD COLUMN ts_range tsrange GENERATED ALWAYE AS (tsrange(time_start, time_end, '[]')) STORED;

Run Code Online (Sandbox Code Playgroud)

看：

PostgreSQL 中的计算/计算/虚拟/派生列

或者，更彻底地，将time_start和替换time_end为范围列。那么索引和查询就简单了一些：

CREATE INDEX events_right_idx ON events USING spgist (ts_range);

SELECT user_id
FROM   events
WHERE  ts_range @> timestamp '2021-08-24T15:30:00'
    OR ts_range @> timestamp '2021-08-24T15:59:00'
GROUP  BY user_id;

Run Code Online (Sandbox Code Playgroud)

但一tsrange列比两列占用的空间更大timestamp。权衡成本和收益。

旁白

Postgres 14（当前测试版）甚至允许覆盖 SP-GiST 索引。发行说明：

允许 SP-GiST 使用 INCLUDE'd 列 (Pavel Borisov)

但我不认为您可以获得特定查询的仅索引扫描。

如果由于某种原因你不得不使用 B 树索引，那么这个固定UNION查询应该不会太糟糕：

SELECT user_id
FROM   events
WHERE  '2021-08-24T15:30:00' BETWEEN time_start AND time_end
UNION
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:59:00' BETWEEN time_start AND time_end

Run Code Online (Sandbox Code Playgroud)

值得注意的是，没有GROUP BY。UNION已经完成了所有工作。
并简化BETWEEN（对性能没有影响）。

timestamp without time zone另外，你似乎有和的疯狂组合timestamp with time zone。并将其命名为“时间”以增加混乱。通常timestamptz是更好的选择。看：

在 Rails 和 PostgreSQL 中完全忽略时区

最后但并非最不重要的一点是，这表明列统计信息不准确，导致查询计划不理想：

-> 对事件进行并行 Seq 扫描（成本=0.00..242553.74行=45952宽度=4）
                                  （实际时间=104.085..183.994行=64循环=3）

跑步

ANALYZE events;

Run Code Online (Sandbox Code Playgroud)

并重试。您的原始查询可以使用普通的 B 树索引。它只是不如建议的 SP-GiST 索引那么有效。
然后也许可以调整您的autovacuum统计设置，以避免将来出现错误的统计数据。看：

归档时间：	4 年，10 月前
查看次数：	328 次
最近记录：	3 年，4 月前