为什么 LIMIT 会扼杀这个 Postgres 查询的性能?

Yar*_*rin 5 postgresql performance query-performance

我有一个 Postgres 物化视图:

Column        |       Type        | Modifiers 
---------------------+-------------------+-----------
document_id         | character varying | 
recorded_date       | date              | 
parcels             | jsonb             | 
Indexes:
"index_my_view_on_document_id" btree (document_id)
"index_my_view_on_recorded_date" btree (recorded_date)
"index_my_view_on_parcels" gin (parcels)
Run Code Online (Sandbox Code Playgroud)

我正在尝试执行一个分页查询,该查询在parcelsjsonb 数组字段上进行过滤,但是每当我添加 LIMIT 时,我的性能就会下降:

无限制:

EXPLAIN ANALYZE SELECT document_id FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC;
                                                                       QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=24178.50..24194.79 rows=6518 width=21) (actual time=11.272..11.275 rows=22 loops=1)
   Sort Key: recorded_date DESC
   Sort Method: quicksort  Memory: 26kB
   ->  Bitmap Heap Scan on my_view  (cost=78.51..23765.58 rows=6518 width=21) (actual time=3.199..10.281 rows=22 loops=1)
         Recheck Cond: (parcels @> '[3022890014]'::jsonb)
         Heap Blocks: exact=12
         ->  Bitmap Index Scan on index_my_view_on_parcels  (cost=0.00..76.88 rows=6518 width=0) (actual time=3.166..3.166 rows=22 loops=1)
               Index Cond: (parcels @> '[3022890014]'::jsonb)
 Planning time: 2.201 ms
 Execution time: 11.395 ms
(10 rows)
Run Code Online (Sandbox Code Playgroud)

有限制:

EXPLAIN ANALYZE SELECT document_id FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC LIMIT 25;
                                                                                              QUERY PLAN                                                                                               
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..2514.14 rows=25 width=21) (actual time=10471.981..17971.454 rows=22 loops=1)
   ->  Index Scan Backward using index_my_view_on_recorded_date on my_view  (cost=0.43..655374.28 rows=6518 width=21) (actual time=10471.980..17971.446 rows=22 loops=1)
         Filter: (parcels @> '[3022890014]'::jsonb)
         Rows Removed by Filter: 6517780
 Planning time: 0.164 ms
 Execution time: 17972.229 ms
(6 rows)
Run Code Online (Sandbox Code Playgroud)

添加 LIMIT 会使查询速度降低 1000 倍!

我能够绕过这个问题做一个嵌套查询,如建议在这里

EXPLAIN ANALYZE SELECT * FROM (SELECT document_id, recorded_date FROM my_view WHERE (parcels @> '[3022890014]') ORDER BY recorded_date DESC) subq ORDER BY recorded_date DESC LIMIT 25;
                                                                          QUERY PLAN                                                                          
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=24178.50..24178.81 rows=25 width=21) (actual time=2.180..2.183 rows=22 loops=1)
   ->  Sort  (cost=24178.50..24194.79 rows=6518 width=21) (actual time=2.179..2.179 rows=22 loops=1)
         Sort Key: my_view.recorded_date DESC
         Sort Method: quicksort  Memory: 26kB
         ->  Bitmap Heap Scan on my_view  (cost=78.51..23765.58 rows=6518 width=21) (actual time=2.064..2.166 rows=22 loops=1)
               Recheck Cond: (parcels @> '[3022890014]'::jsonb)
               Heap Blocks: exact=12
               ->  Bitmap Index Scan on index_my_view_on_parcels  (cost=0.00..76.88 rows=6518 width=0) (actual time=2.030..2.030 rows=22 loops=1)
                     Index Cond: (parcels @> '[3022890014]'::jsonb)
 Planning time: 6.427 ms
 Execution time: 2.230 ms
(11 rows)
Run Code Online (Sandbox Code Playgroud)

不过,我想了解为什么添加 LIMIT 会导致性能发生如此巨大的变化,以及是否有更好的方法来解决这个问题。

jja*_*nes 10

PostgreSQL 认为它会找到 6518 行满足您的条件。因此,当您告诉它在 25 处停止时,它会认为它宁愿扫描已经按顺序排列的行,并在找到按顺序排列的第 25 行(即表的 25/6518 或 0.4% 之后)后停止。但实际上只有 22 行满足要求,所以最终不得不扫描整个表,这是比想象中多 250 倍的工作。另一个计划,使用 gin 索引,最终比 PostgreSQL 认为的少 250 多倍,出于同样的原因——它认为它会找到并排序 6518 事物,而实际上它是 22 事物。

如果您使用更合适的数据结构,例如常规 PostgreSQL 数组而不是退化的 JSONB 对象,那么它会更准确地了解有多少行满足条件,并且可能会做出更好的选择。