如何获得 OR'ed 时间范围谓词的索引扫描?

Dmi*_*tro 4 postgresql index range-types query-performance postgresql-performance

我有events包含字段的表:

id
user_id
time_start
time_end
...
Run Code Online (Sandbox Code Playgroud)

并在 上有 B 树索引(time_start, time_end)

SELECT user_id
FROM events
WHERE ((time_start <= '2021-08-24T15:30:00+00:00' AND time_end >= '2021-08-24T15:30:00+00:00') OR
       (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id);
Run Code Online (Sandbox Code Playgroud)
Group  (cost=243735.42..243998.32 rows=1103 width=4) (actual time=186.533..188.244 rows=166 loops=1)
  Group Key: user_id
  Buffers: shared hit=224848
  ->  Gather Merge  (cost=243735.42..243992.80 rows=2206 width=4) (actual time=186.532..188.199 rows=176 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        Buffers: shared hit=224848
        ->  Sort  (cost=242735.39..242738.15 rows=1103 width=4) (actual time=184.121..184.126 rows=59 loops=3)
              Sort Key: user_id
              Sort Method: quicksort  Memory: 27kB
              Worker 0:  Sort Method: quicksort  Memory: 27kB
              Worker 1:  Sort Method: quicksort  Memory: 28kB
              Buffers: shared hit=224848
              ->  Partial HashAggregate  (cost=242668.62..242679.65 rows=1103 width=4) (actual time=184.065..184.085 rows=59 loops=3)
                    Group Key: user_id
                    Buffers: shared hit=224834
                    ->  Parallel Seq Scan on events  (cost=0.00..242553.74 rows=45952 width=4) (actual time=104.085..183.994 rows=64 loops=3)
                          Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))
                          Rows Removed by Filter: 708728
                          Buffers: shared hit=224834
Planning Time: 0.169 ms
Execution Time: 188.294 ms
Run Code Online (Sandbox Code Playgroud)

Postgres 与过滤器一起使用Seq Scan

Filter: (((time_start <= '2021-08-24 15:30:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:30:00+00'::timestamp with time zone)) OR ((time_start <= '2021-08-24 15:59:00+00'::timestamp with time zone) AND (time_end >= '2021-08-24 15:59:00+00'::timestamp with time zone)))
Run Code Online (Sandbox Code Playgroud)

但是当我留下一个条件时time_starttime_end它就开始使用索引扫描。

如何更改条件以使 Postgres 使用索引扫描而不是顺序扫描?

我不想使用UNION像:

SELECT user_id
FROM events
WHERE (
     (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
GROUP BY user_id)
UNION (SELECT user_id
       FROM events
       WHERE (
           (time_start <= '2021-08-24T15:59:00+00:00' AND time_end >= '2021-08-24T15:59:00+00:00'))
   GROUP BY user_id
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 6

表达指数

包含时间戳范围的GiST或甚至更好)SP-GiST表达式索引应该会产生奇迹。

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));
Run Code Online (Sandbox Code Playgroud)

使用“范围包含”运算符重写您的查询并匹配索引表达式(与原始表达式完全相同):@>

SELECT user_id
FROM   events
WHERE  tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:30:00'
    OR tsrange(time_start, time_end, '[]') @> timestamp '2021-08-24 15:59:00'
GROUP  BY user_id;
Run Code Online (Sandbox Code Playgroud)

您将得到如下查询计划:

CREATE INDEX events_right_idx ON events USING spgist (tsrange(time_start, time_end, '[]'));
Run Code Online (Sandbox Code Playgroud)

应该会快很多

除非另有说明,范围类型假定包含下限和排除上限。tsrange(time_start, time_end)是相同的tsrange(time_start, time_end), '[)')。
由于您使用>=和进行操作<=,因此请使用 来包含两个边界tsrange(time_start, time_end, '[]')

有关的:

或者,将范围列存储在表中

不过,作为普通(非表达式)索引,应该会快一点。
您可以将时间戳范围列添加到表中,例如:

ALTER TABLE event ADD COLUMN ts_range tsrange GENERATED ALWAYE AS (tsrange(time_start, time_end, '[]')) STORED;
Run Code Online (Sandbox Code Playgroud)

看:

或者,更彻底地,将time_start和替换time_end为范围列。那么索引和查询就简单了一些:

CREATE INDEX events_right_idx ON events USING spgist (ts_range);

SELECT user_id
FROM   events
WHERE  ts_range @> timestamp '2021-08-24T15:30:00'
    OR ts_range @> timestamp '2021-08-24T15:59:00'
GROUP  BY user_id;
Run Code Online (Sandbox Code Playgroud)

但一tsrange列比两列占用的空间更大timestamp。权衡成本和收益。

旁白

Postgres 14(当前测试版)甚至允许覆盖 SP-GiST 索引。发行说明:

允许 SP-GiST 使用 INCLUDE'd 列 (Pavel Borisov)

但我不认为您可以获得特定查询的仅索引扫描。

如果由于某种原因你不得不使用 B 树索引,那么这个固定UNION查询应该不会太糟糕:

SELECT user_id
FROM   events
WHERE  '2021-08-24T15:30:00' BETWEEN time_start AND time_end
UNION
SELECT user_id
FROM   events
WHERE  '2021-08-24T15:59:00' BETWEEN time_start AND time_end
Run Code Online (Sandbox Code Playgroud)

值得注意的是,没有GROUP BYUNION已经完成了所有工作。
并简化BETWEEN(对性能没有影响)。

timestamp without time zone另外,你似乎有和的疯狂组合timestamp with time zone。并将其命名为“时间”以增加混乱。通常timestamptz是更好的选择。看:

最后但并非最不重要的一点是,这表明列统计信息不准确,导致查询计划不理想:

-> 对事件进行并行 Seq 扫描(成本=0.00..242553.74行=45952宽度=4)
                                  (实际时间=104.085..183.994行=64循环=3)

跑步

ANALYZE events;
Run Code Online (Sandbox Code Playgroud)

并重试。您的原始查询可以使用普通的 B 树索引。它只是不如建议的 SP-GiST 索引那么有效。
然后也许可以调整您的autovacuum统计设置,以避免将来出现错误的统计数据。看: