从大表中获取每组最大价值的高效查询

Fey*_*eyd 17 postgresql performance index greatest-n-per-group

鉴于表:

    Column    |            Type             
 id           | integer                     
 latitude     | numeric(9,6)                
 longitude    | numeric(9,6)                
 speed        | integer                     
 equipment_id | integer                     
 created_at   | timestamp without time zone
Indexes:
    "geoposition_records_pkey" PRIMARY KEY, btree (id)
Run Code Online (Sandbox Code Playgroud)

该表有 2000 万条记录,相对而言,这不是一个大数目。但它会使顺序扫描变慢。

我怎样才能获得max(created_at)每个的最后一条记录 ( ) equipment_id

我已经尝试了以下两个查询,其中有几个变体,我已经阅读了本主题的许多答案:

select max(created_at),equipment_id from geoposition_records group by equipment_id;

select distinct on (equipment_id) equipment_id,created_at 
  from geoposition_records order by equipment_id, created_at desc;
Run Code Online (Sandbox Code Playgroud)

我也尝试过创建 btree 索引,equipment_id,created_at但 Postgres 发现使用 seqscan 更快。强制enable_seqscan = off也没有用,因为读取索引与 seq 扫描一样慢,可能更糟。

查询必须定期运行,始终返回最后一个。

使用 Postgres 9.3。

解释/分析(有 170 万条记录):

set enable_seqscan=true;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"HashAggregate  (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)"
"  ->  Seq Scan on geoposition_records  (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)"
"Total runtime: 1935.632 ms"

set enable_seqscan=false;
explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id;
"GroupAggregate  (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)"
"  ->  Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records  (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)"
"Total runtime: 11305.161 ms"
Run Code Online (Sandbox Code Playgroud)

Erw*_*ter 11

指数

毕竟,一个普通的多列 B 树索引应该可以工作:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);
Run Code Online (Sandbox Code Playgroud)

为什么DESC NULLS LAST

假设你有一张equipment桌子是安全的吗?那么性能就不会成为问题:

相关子查询

基于这个equipment表,运行一个低相关的子查询,效果很好:

SELECT equipment_id
     , (SELECT created_at
        FROM   geoposition_records
        WHERE  equipment_id = eq.equipment_id
        ORDER  BY created_at DESC NULLS LAST
        LIMIT  1) AS latest
FROM   equipment eq;
Run Code Online (Sandbox Code Playgroud)

对于表中的少量行equipment(从您的EXPLAIN ANALYZE输出判断为 57 ),这非常快

LATERAL 加入 Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;
Run Code Online (Sandbox Code Playgroud)

详细解释:

性能类似于相关子查询。

功能

如果您无法向查询计划器(这不应该发生)讲道理,那么循环遍历设备表的函数肯定可以解决问题。一次查找一个equipment_id使用索引。

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;
Run Code Online (Sandbox Code Playgroud)

也是一个很好的通话:

SELECT * FROM f_latest_equip();
Run Code Online (Sandbox Code Playgroud)

性能对比:

db<>fiddle here
旧的sqlfiddle


Col*_*art 4

尝试1

如果

  1. 我有一张单独的equipment桌子,并且
  2. 我有一个索引geoposition_records(equipment_id, created_at desc)

那么以下内容对我有用:

select id as equipment_id, (select max(created_at)
                            from geoposition_records
                            where equipment_id = equipment.id
                           ) as max_created_at
from equipment;
Run Code Online (Sandbox Code Playgroud)

我无法强制 PG 进行快速查询来确定s 列表equipment_id相关max(created_at). 但明天我要再试一次!

尝试2

我找到了这个链接:http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values 将此技术与尝试 1 中的查询结合起来,我得到:

WITH RECURSIVE equipment(id) AS (
    SELECT MIN(equipment_id) FROM geoposition_records
  UNION
    SELECT (
      SELECT equipment_id
      FROM geoposition_records
      WHERE equipment_id > equipment.id
      ORDER BY equipment_id
      LIMIT 1
    )
    FROM equipment WHERE id IS NOT NULL
)
SELECT id AS equipment_id, (SELECT MAX(created_at)
                            FROM geoposition_records
                            WHERE equipment_id = equipment.id
                           ) AS max_created_at
FROM equipment;
Run Code Online (Sandbox Code Playgroud)

而且效果很快!但你需要

  1. 这种超扭曲的查询形式,以及
  2. 上的索引geoposition_records(equipment_id, created_at desc)