日期范围内未使用的索引查询

Question

日期范围内未使用的索引查询

and*_*oke 6 postgresql performance index order-by query-performance

我有一个未使用现有索引的查询，我不明白为什么。

桌子：

mustang=# \d+ bss.amplifier_saturation
                                               Table "bss.amplifier_saturation"
 Column |           Type           |                             Modifiers                             | Storage | Description 
--------+--------------------------+-------------------------------------------------------------------+---------+-------------
 value  | integer                  | not null                                                          | plain   | 
 target | integer                  | not null                                                          | plain   | 
 start  | timestamp with time zone | not null                                                          | plain   | 
 end    | timestamp with time zone | not null                                                          | plain   | 
 id     | integer                  | not null default nextval('amplifier_saturation_id_seq'::regclass) | plain   | 
 lddate | timestamp with time zone | not null default now()                                            | plain   | 
Indexes:
    "amplifier_saturation_pkey" PRIMARY KEY, btree (id)
    "amplifier_saturation_target_start_end_key" UNIQUE CONSTRAINT, btree (target, start, "end")
    "amplifier_saturation_end" btree ("end")
    "amplifier_saturation_lddate" btree (lddate)
    "amplifier_saturation_start" btree (start)
    "amplifier_saturation_target" btree (target)
    "amplifier_saturation_value" btree (value)

Run Code Online (Sandbox Code Playgroud)

查询/计划：

mustang=# explain select max(lddate) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and   start <= '1988-04-09 00:00:00';
                                                                        QUERY PLAN                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=189.41..189.42 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..189.41 rows=1 width=8)
           ->  Index Scan Backward using amplifier_saturation_lddate on amplifier_saturation  (cost=0.00..2475815.50 rows=13071 width=8)
                 Index Cond: (lddate IS NOT NULL)
                 Filter: ((start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))

Run Code Online (Sandbox Code Playgroud)

为什么这不使用索引amplifier_saturation_start？在我看来，数据库应该扫描它以找到开始日期，然后继续将所有条目分隔到结束日期，最后对数据的（小子集）进行最大排序lddate（类似于 pp40-41 SQL 性能解释）。

我也(start, start desc)绝望地尝试了一个索引，但没有帮助。

顺便说一句，select count(*)工作得很好：

mustang=# explain select count(*) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and   start <= '1988-04-09 00:00:00';
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=38711.84..38711.85 rows=1 width=0)
   ->  Index Scan using amplifier_saturation_start on amplifier_saturation  (cost=0.00..38681.47 rows=12146 width=0)
         Index Cond: ((start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))

Run Code Online (Sandbox Code Playgroud)

跑步ANALYZE没有帮助。
pg_stats 显示了 start 值的合理分布，这似乎有理由使用该索引。
在任一列（开始或lddate）上将统计信息设置为 10,000 都没有帮助。

也许我应该解释为什么我认为这个计划是错误的。该表包含30,000,000 行。只有 3,500 个在日期范围内。但也许这对它们来说仍然太多而无法单独阅读？

在(lddate desc, start)作品上添加索引（不确定desc是否需要）。然后它可以使用纯索引方法 (IIUC) 并且运行得更快：

mustang=# create index tmp_as on bss.amplifier_saturation (lddate desc, start);
CREATE INDEX
mustang=# explain select max(lddate) from bss.amplifier_saturation
where start >= '1987-12-31 00:00:00'
and   start <= '1988-04-09 00:00:00';
                                                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=69.76..69.77 rows=1 width=0)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.00..69.76 rows=1 width=8)
           ->  Index Scan using tmp_as on amplifier_saturation  (cost=0.00..861900.22 rows=12356 width=8)
                 Index Cond: ((lddate IS NOT NULL) AND (start >= '1987-12-31 00:00:00-08'::timestamp with time zone) AND (start <= '1988-04-09 00:00:00-07'::timestamp with time zone))

Run Code Online (Sandbox Code Playgroud)

所以，我想回答我自己的问题，似乎访问数据 3,500 次的成本比 30,000,000 次值扫描（旋转磁盘）慢。而纯索引扫描显然更好。

也许比我更聪明的人提出了更好的答案？

Answer 1

Erw*_*ter 12

解释

我的问题是：为什么这不使用索引amplifier_saturation_start？

即使30,000,000 rows, only 3,500 in the date range它能够更快的从指数的顶部读取元组amplifier_saturation_lddate上lddate。通过过滤器的第一行start可以原样返回。不需要排序步骤。对于完全随机分布，平均必须检查略低于 9000 个索引元组。

使用amplifier_saturation_start，Postgres 仍然必须max(lddate)在获取所有 3500 行符合条件的行后确定。千钧一发。该决定取决于收集的统计数据和您的成本设置。根据数据分布和其他细节，一个或另一个会更快，而一个或另一个预计会更快。

更好的指数

使用多列索引，这可以大大加快，(lddate, start)就像您已经发现的那样。这样 Postgres 可以使用仅索引扫描并且根本不接触堆（表）。

但还有另一种轻微的事情，你可以提高。您是否想知道EXPLAIN输出中的这个细节？

Index Cond: ((lddate IS NOT NULL) AND ...
Run Code Online (Sandbox Code Playgroud)

为什么 Postgres 必须排除 NULL 值？
因为 NULL在中的最大值之后排序ASCENDING，或者在order之前DESCENDING排序。max()如果有 NULL 值，聚合函数返回的最大非空值不在索引的开头/结尾。添加NULLS LAST | FIRST将排序顺序调整为max()（并使相反的min()更昂贵）的特征。由于我们最感兴趣的是最新的时间戳，因此DESC NULLS LAST是更好的选择。

CREATE INDEX tmp_as ON bss.amplifier_saturation (lddate DESC NULLS LAST, start);

Run Code Online (Sandbox Code Playgroud)

现在，您的表列lddate显然没有 NULL 值，正在定义NOT NULL。在这种特殊情况下，对性能的影响可以忽略不计。对于可以为NULL 的情况，仍然值得一提。

另一个索引选项是 on (start, lddate)，基本上是一个拉皮条amplifier_saturation_start索引，它也允许仅索引扫描。根据查询中的数据分布和实际参数值，一个或另一个会更快。

两个注意事项 `timestamp`

您的表列是timestamptz，但您的查询谓词使用timestamp文字。Postgres 从您当前的timezone设置中导出时区并相应地进行调整。这可能是也可能不是预期的。它肯定会使查询变得不稳定- 取决于您的会话设置。对于可能来自不同时区（具有不同会话设置）的呼叫，这将是有问题的。那么您宁愿使用显式偏移量或AT TIME ZONE构造来使其稳定。细节：

在 Rails 和 PostgreSQL 中完全忽略时区

您通常希望排除正确性的上限。<而不是<=.

select max(lddate)
from   bss.amplifier_saturation
where  start >= '1987-12-31 00:00:00'::timestamp AT TIME ZONE 'PST'
and    start <  '1988-04-09 00:00:00 PST'::timestamptz; -- shorter

Run Code Online (Sandbox Code Playgroud)

PST （太平洋标准时间）是一个随机示例时区。

归档时间：	10 年，7 月前
查看次数：	3718 次
最近记录：	5 年，2 月前

日期范围内未使用的索引查询

解释

更好的指数

两个注意事项 timestamp

两个注意事项 `timestamp`