优化分组最大查询

Question

优化分组最大查询

nur*_*tin 6 sql postgresql query-optimization greatest-n-per-group groupwise-maximum

select * 
from records 
where id in ( select max(id) from records group by option_id )

Run Code Online (Sandbox Code Playgroud)

此查询即使在数百万行上也能正常工作.但是从解释声明的结果可以看出:

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
->  HashAggregate  (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
     ->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
           ->  Seq Scan on records records_1  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->  Index Scan using records_pkey on records  (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms

Run Code Online (Sandbox Code Playgroud)

(cost=0.00..23995.15 rows=1240315 width=8) < - 这里说它正在扫描所有行,这显然是低效的.

我也尝试重新排序查询:

select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
     ->  Seq Scan on records  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
->  Index Scan using records_pkey on records r  (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms

Run Code Online (Sandbox Code Playgroud)

(cost=0.00..23995.15 rows=1240315 width=8) < - 仍然扫描所有行.

我想有和无指数(option_id),(option_id, id),(option_id, id desc),他们都没有查询计划产生任何影响.

有没有办法在Postgres中执行分组最大查询而不扫描所有行？

我在编程方面寻找的是一个索引,它存储每个option_id插入记录表时的最大id .这样,当我查询option_ids的最大值时,我应该只需要扫描索引记录的次数与不同的option_ids一样多.

我select distinct on从高级用户那里看到了各种各样的答案(感谢@Clodoaldo Neto为我提供搜索关键词).这就是为什么它不起作用:

create index index_name on records(option_id, id desc)

select distinct on (option_id) *
from records
order by option_id, id desc
                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
  ->  Index Scan using records_option_id_id_idx on records  (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms

Run Code Online (Sandbox Code Playgroud)

这很好,它使用索引.但是,使用索引扫描所有ID并不是很有意义.根据我的执行情况,它实际上比简单的顺序扫描慢.

有趣的是,MySQL 5.5能够简单地使用索引来优化查询 records(option_id, id)

mysql> select count(1) from records;

+----------+
| count(1) |
+----------+
|  1086342 |
+----------+

1 row in set (0.00 sec)

mysql> explain extended select * from records
       inner join ( select max(id) max_id from records group by option_id ) mr
                                                      on mr.max_id= records.id;

+------+----------+--------------------------+
| rows | filtered | Extra                    |
+------+----------+--------------------------+
| 1056 |   100.00 |                          |
|    1 |   100.00 |                          |
|  201 |   100.00 | Using index for group-by |
+------+----------+--------------------------+

3 rows in set, 1 warning (0.02 sec)

Run Code Online (Sandbox Code Playgroud)

Answer 1

Erw*_*ter 10

假设行中options的行records数相对较少.

通常,您将拥有一个从中引用的查找表optionsrecords.option_id,最好使用外键约束.如果你不这样做,我建议创建一个来强制参照完整性:

CREATE TABLE options (
  option_id int  PRIMARY KEY
, option    text UNIQUE NOT NULL
);

INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM   records;

Run Code Online (Sandbox Code Playgroud)

然后我们不再需要模拟松散的索引扫描,这变得非常简单和快速.相关子查询可以使用普通索引(option_id, id).

SELECT option_id
      ,(SELECT max(id)
        FROM   records
        WHERE  option_id = o.option_id
       ) AS max_id
FROM   options o
ORDER  BY 1;

Run Code Online (Sandbox Code Playgroud)

这包括表中没有匹配的选项records.你得到NULL,如果需要max_id你可以轻松删除外部的这些行SELECT.

或(同样的结果):

SELECT option_id
     , (SELECT id
        FROM   records
        WHERE  option_id = o.option_id
        ORDER  BY id DESC NULLS LAST
       ) AS max_id
FROM   options o
ORDER  BY 1;

Run Code Online (Sandbox Code Playgroud)

可能会快一点.子查询使用排序顺序DESC NULLS LAST- 与max()忽略NULL值的聚合函数相同.排序DESC只会先为NULL:

为什么在PostgreSQL查询中排序DESC时会出现NULL值？

所以,完美的索引:

CREATE INDEX on records (option_id, id DESC NULLS LAST);

Run Code Online (Sandbox Code Playgroud)

定义列时无关紧要NOT NULL.

小表上仍然可以进行顺序扫描options,这只是获取所有行的最快方法.所述ORDER BY可在指数带来(只)扫描来获取预先排序的行.
大表records只通过(位图)索引扫描访问-或者,如果可能的话,仅索引扫描.

SQL Fiddle显示简单情况的两个仅索引扫描.

或者LATERAL在Postgres 9.3+中使用连接以获得类似的效果:

优化GROUP BY查询以检索每个用户的最新记录

归档时间：	11 年，5 月前
查看次数：	5435 次
最近记录：	9 年前