如何在 PostgreSQL 中使 DISTINCT ON 更快？

Question

如何在 PostgreSQL 中使 DISTINCT ON 更快？

Kok*_*zzu 16 postgresql performance greatest-n-per-group postgresql-9.6 query-performance

我station_logs在 PostgreSQL 9.6 数据库中有一个表：

    Column     |            Type             |    
---------------+-----------------------------+
 id            | bigint                      | bigserial
 station_id    | integer                     | not null
 submitted_at  | timestamp without time zone | 
 level_sensor  | double precision            | 
Indexes:
    "station_logs_pkey" PRIMARY KEY, btree (id)
    "uniq_sid_sat" UNIQUE CONSTRAINT, btree (station_id, submitted_at)

Run Code Online (Sandbox Code Playgroud)

我试图level_sensor根据submitted_at, 对于每个station_id. 大约有 400 个唯一station_id值，每个station_id.

创建索引之前：

EXPLAIN ANALYZE
SELECT DISTINCT ON(station_id) station_id, submitted_at, level_sensor
FROM station_logs ORDER BY station_id, submitted_at DESC;

Run Code Online (Sandbox Code Playgroud)

 唯一（成本=4347852.14..4450301.72行=89宽度=20）（实际时间=22202.080..27619.167行=98循环=1）
   -> Sort (cost=4347852.14..4399076.93 rows=20489916 width=20) (实际时间=22202.077..26540.827 rows=20489812 loops=1)
         排序键：station_id、submitted_at DESC
         排序方式：外部合并磁盘：681040kB
         -> 对 station_logs 进行 Seq 扫描（成本=0.00..598895.16 行=20489916 宽度=20）（实际时间=0.023..3443.587 行=20489812 循环=$
 规划时间：0.072 ms
 执行时间：27690.644 毫秒

创建索引：

CREATE INDEX station_id__submitted_at ON station_logs(station_id, submitted_at DESC);

Run Code Online (Sandbox Code Playgroud)

创建索引后，对于同一个查询：

 唯一（成本=0.56..2156367.51 行=89 宽度=20）（实际时间=0.184..16263.413 行=98 个循环=1）
   -> 使用station_id__submitted_at 对station_logs 进行索引扫描（成本=0.56..2105142.98 行=20489812 宽度=20）（实际时间=0.181..1$
 规划时间：0.206 ms
 执行时间：16263.490 毫秒

有没有办法让这个查询更快？以 1 秒为例，16 秒仍然太多了。

Answer 1

Erw*_*ter 21

对于仅 400 个站，此查询将大大加快：

SELECT s.station_id, l.submitted_at, l.level_sensor
FROM   station s
CROSS  JOIN LATERAL (
   SELECT submitted_at, level_sensor
   FROM   station_logs
   WHERE  station_id = s.station_id
   ORDER  BY submitted_at DESC NULLS LAST
   LIMIT  1
   ) l;

Run Code Online (Sandbox Code Playgroud)

dbfiddle here
_{（比较此查询的计划、Abelisto 的替代方案和您的原始方案）}

结果EXPLAIN ANALYZE由 OP 提供：

 嵌套循环（成本=0.56..356.65 行=102 宽度=20）（实际时间=0.034..0.979 行=98 次循环=1）
   -> 对站 s 的 Seq 扫描（成本=0.00..3.02 行=102 宽度=4）（实际时间=0.009..0.016 行=102 循环=1）
   -> 限制（cost=0.56..3.45 rows=1 width=16）（实际时间=0.009..0.009 rows=1 loops=102）
         -> 使用station_id__submitted_at 对station_logs 进行索引扫描（成本=0.56..664062.38 行=230223 宽度=16）（实际时间=0.009$
               索引条件：（station_id = s.id）
 规划时间：0.542 ms
 执行时间：1.013 ms   -- !!

您需要的唯一索引是您创建的索引：station_id__submitted_at。基本上，UNIQUE约束uniq_sid_sat也可以完成这项工作。维护两者似乎浪费磁盘空间和写入性能。

我在查询中添加了NULLS LASTtoORDER BY因为submitted_at未定义NOT NULL。理想情况下，如果适用！NOT NULL向列添加约束submitted_at，删除附加索引并NULLS LAST从查询中删除。

如果submitted_at可以NULL，请创建此UNIQUE索引以替换您当前的索引和唯一约束：

CREATE UNIQUE INDEX station_logs_uni ON station_logs(station_id, submitted_at DESC NULLS LAST);

Run Code Online (Sandbox Code Playgroud)

考虑：

这是假设一个单独的表station，每个相关station_id（通常是 PK）一行- 您应该有任何一种方式。如果您没有它，请创建它。同样，使用这种 rCTE 技术非常快：

CREATE TABLE station AS
WITH RECURSIVE cte AS (
   (
   SELECT station_id
   FROM   station_logs
   ORDER  BY station_id
   LIMIT  1
   )
   UNION ALL
   SELECT l.station_id
   FROM   cte c
   ,      LATERAL (   
      SELECT station_id
      FROM   station_logs
      WHERE  station_id > c.station_id
      ORDER  BY station_id
      LIMIT  1
      ) l
   )
TABLE cte;

Run Code Online (Sandbox Code Playgroud)

我也在小提琴中使用它。您可以使用类似的查询来直接解决您的任务，而无需station表 - 如果您无法说服创建它。

详细说明、解释和替代方案：

优化索引

您的查询现在应该非常快。只有当您仍然需要优化读取性能时......

将level_sensor作为最后一列添加到索引以允许仅索引扫描可能是有意义的，例如joanolo commented。
缺点：它使索引更大——这给使用它的所有查询增加了一点成本。
优点：如果您实际上只得到索引扫描，那么手头的查询根本不必访问堆页面，这使其速度提高了一倍。但这对于现在非常快速的查询来说可能是微不足道的。

但是，我不希望这对您的情况有效。你提到：

...每天大约 20k 行station_id。

通常，这表示不断写入负载（station_id每 5 秒1 次）。并且您对最新的行感兴趣。仅索引扫描仅适用于所有事务可见的堆页面（可见性映射中的位已设置）。您必须VACUUM为该表运行极其激进的设置以跟上写入负载，但在大多数情况下它仍然无法正常工作。如果我的假设是正确的，则仅索引扫描已失效，请勿添加level_sensor到索引中。

OTOH，如果我的假设成立，并且您的表变得非常大，那么BRIN 索引可能会有所帮助。有关的：

加速创建 Postgres 部分索引

或者，甚至更专业和更有效：仅针对最新添加的部分索引以切断大量不相关的行：

CREATE INDEX station_id__submitted_at_recent_idx ON station_logs(station_id, submitted_at DESC NULLS LAST)
WHERE submitted_at > '2017-06-24 00:00';

Run Code Online (Sandbox Code Playgroud)

选择一个您知道新行必须存在的时间戳。您必须为WHERE所有查询添加匹配条件，例如：

...
WHERE  station_id = s.station_id
AND    submitted_at > '2017-06-24 00:00'
...

Run Code Online (Sandbox Code Playgroud)

您必须不时调整索引和查询。
具有更多详细信息的相关答案：

Answer 2

Abe*_*sto 6

试试经典方法：

create index idx_station_logs__station_id on station_logs(station_id);
create index idx_station_logs__submitted_at on station_logs(submitted_at);

analyse station_logs;

with t as (
  select station_id, max(submitted_at) submitted_at 
  from station_logs 
  group by station_id)
select * 
from t join station_logs l on (
  l.station_id = t.station_id and l.submitted_at = t.submitted_at);

Run Code Online (Sandbox Code Playgroud)

数据库小提琴

ThreadStarter 的解释分析

 Nested Loop  (cost=701344.63..702110.58 rows=4 width=155) (actual time=6253.062..6253.544 rows=98 loops=1)
   CTE t
     ->  HashAggregate  (cost=701343.18..701344.07 rows=89 width=12) (actual time=6253.042..6253.069 rows=98 loops=1)
           Group Key: station_logs.station_id
           ->  Seq Scan on station_logs  (cost=0.00..598894.12 rows=20489812 width=12) (actual time=0.034..1841.848 rows=20489812 loop$
   ->  CTE Scan on t  (cost=0.00..1.78 rows=89 width=12) (actual time=6253.047..6253.085 rows=98 loops=1)
   ->  Index Scan using station_id__submitted_at on station_logs l  (cost=0.56..8.58 rows=1 width=143) (actual time=0.004..0.004 rows=$
         Index Cond: ((station_id = t.station_id) AND (submitted_at = t.submitted_at))
 Planning time: 0.542 ms
 Execution time: 6253.701 ms

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，11 月前
查看次数：	11904 次
最近记录：	7 年，4 月前