Postgres 是否使用子查询优化此 JOIN?

fla*_*vio 5 postgresql join subquery

在 Postgres 12 中,我有一张桌子purchase_orders和一张桌子items。我正在运行一个查询,该查询返回给定的 POshop和每个 PO 上订购的项目的总和:

SELECT po.id, 
       SUM(grouped_items.total_quantity) AS total_quantity
FROM purchase_orders po
LEFT JOIN (
  SELECT purchase_order_id, 
  SUM(quantity) AS total_quantity
  FROM items
  GROUP BY purchase_order_id
) grouped_items ON po.id = grouped_items.purchase_order_id

WHERE po.shop_id = 195
GROUP BY po.id
Run Code Online (Sandbox Code Playgroud)

此查询返回所需的结果。JOIN 在一个子查询中,因为会有其他 JOINS 到其他表,所以这会生成一个已经分组的表来连接。

我用相关 SELECT子查询而不是 JOIN编写了另一个查询。运行这两种方法的执行时间几乎相同,因此很难看出哪个更快。我跑了,EXPLAIN ANALYZE但不能很好地解释它。

问题:在上面的例子中,Postgres 会处理items子查询的整个表,并且只有在与purchase_orders? 或者它是否足够聪明来过滤集合,如果items首先?

EXPLAIN报告提到了“Seq Scan on Items...”,它似乎包含 中的所有行items,然后随着它向上移动树而减少。但不确定这是否意味着它实际上SUM将整个表都放入了内存中。

解释:

GroupAggregate  (cost=6948.16..6973.00 rows=1242 width=40) (actual time=165.099..166.321 rows=1242 loops=1)
  Group Key: po.id
  Buffers: shared hit=4148
  ->  Sort  (cost=6948.16..6951.27 rows=1242 width=16) (actual time=165.090..165.406 rows=1242 loops=1)
        Sort Key: po.id
        Sort Method: quicksort  Memory: 107kB
        Buffers: shared hit=4148
        ->  Hash Right Join  (cost=6668.31..6884.34 rows=1242 width=16) (actual time=99.951..120.627 rows=1242 loops=1)
              Hash Cond: (items.purchase_order_id = po.id)
              Buffers: shared hit=4148
              ->  HashAggregate  (cost=5906.04..5993.80 rows=8776 width=16) (actual time=98.328..104.320 rows=14331 loops=1)
                    Group Key: items.purchase_order_id
                    Buffers: shared hit=3749
                    ->  Seq Scan on items  (cost=0.00..5187.03 rows=143803 width=12) (actual time=0.005..38.307 rows=143821 loops=1)
                          Buffers: shared hit=3749
              ->  Hash  (cost=746.74..746.74 rows=1242 width=8) (actual time=1.588..1.588 rows=1242 loops=1)
                    Buckets: 2048  Batches: 1  Memory Usage: 65kB
                    Buffers: shared hit=399
                    ->  Bitmap Heap Scan on purchase_orders po  (cost=33.91..746.74 rows=1242 width=8) (actual time=0.200..1.169 rows=1242 loops=1)
                          Recheck Cond: (shop_id = 195)
                          Heap Blocks: exact=392
                          Buffers: shared hit=399
                          ->  Bitmap Index Scan on index_purchase_orders_on_shop_id  (cost=0.00..33.60 rows=1242 width=0) (actual time=0.153..0.153 rows=1258 loops=1)
                                Index Cond: (shop_id = 195)
                                Buffers: shared hit=7
Planning time: 0.200 ms
Execution time: 166.665 ms
Run Code Online (Sandbox Code Playgroud)

第二种方法,使用相关子查询:

SELECT po.id,
       (
           SELECT SUM(quantity)
           FROM items
           WHERE purchase_order_id = po.id
           GROUP BY purchase_order_id
       ) AS total_quantity
FROM purchase_orders po
WHERE shop_id = 195
GROUP BY po.id
Run Code Online (Sandbox Code Playgroud)

解释:

HashAggregate  (cost=749.84..25716.43 rows=1242 width=16) (actual time=1.667..9.488 rows=1243 loops=1)
  Group Key: po.id
  Buffers: shared hit=5603
  ->  Bitmap Heap Scan on purchase_orders po  (cost=33.91..746.74 rows=1242 width=8) (actual time=0.175..1.072 rows=1243 loops=1)
        Recheck Cond: (shop_id = 195)
        Heap Blocks: exact=390
        Buffers: shared hit=397
        ->  Bitmap Index Scan on index_purchase_orders_on_shop_id  (cost=0.00..33.60 rows=1242 width=0) (actual time=0.130..0.130 rows=1244 loops=1)
              Index Cond: (shop_id = 195)
              Buffers: shared hit=7
  SubPlan 1
    ->  GroupAggregate  (cost=0.42..20.09 rows=16 width=16) (actual time=0.005..0.005 rows=1 loops=1243)
          Group Key: items.purchase_order_id
          Buffers: shared hit=5206
          ->  Index Scan using index_items_on_purchase_order_id on items  (cost=0.42..19.85 rows=16 width=12) (actual time=0.003..0.004 rows=3 loops=1243)
                Index Cond: (purchase_order_id = po.id)
                Buffers: shared hit=5206
Planning time: 0.183 ms
Execution time: 9.831 ms
Run Code Online (Sandbox Code Playgroud)

jja*_*nes 5

我最近自己也研究过这个问题,我的结论是规划者不够聪明,无法优化这个特定的事情。即使行数很大,相关的子选择也会为每一行执行一次,而不相关的子选择将被执行完成,即使只需要其中的几行。

它确实知道一个比另一个更快(假设估计的行数相当正确),但它缺乏识别两个公式相同的能力,因此无法根据估计的性能在执行计划之间进行选择。

尽管在您的情况下,查询不会相同,因为它们以不同的方式处理“items”中缺失的行。相关子选择将与左连接相同,而不是内连接。


Erw*_*ter 5

要获得实际的性能提升,LEFT JOIN请使用聚合子查询,但重复外部查询的(选择性!)谓词。

SELECT po.number
     , SUM(grouped_items.total_quantity) AS total_quantity
FROM   purchase_orders po
LEFT   JOIN (
   SELECT purchase_order_id AS id
        , SUM(quantity) AS total_quantity
   FROM   items
   WHERE  purchase_order_id IN (1, 2, 3)  -- repeat selective condition !
   GROUP  BY 1
   ) grouped_items USING (id)
WHERE  po.id IN (1, 2, 3)
GROUP  BY po.number;
Run Code Online (Sandbox Code Playgroud)

幸运的是,这在您的情况下是可能的:谓词适用于子查询。有点详细,但通常可以提供最佳性能,无论items涉及的行百分比是多少。我的经验法则是“先聚合,后加入”。看:

其他情况则没有那么幸运。然后你必须决定走哪条路。正如jjanes 所解释的,Postgres 不够智能,无法进行太多优化。如果items涉及所有或大多数行,则在子查询中聚合通常(快得多)。如果只涉及几行,则相关子查询或等效子LATERAL查询通常(快得多)。看:

对于仅来自外部查询 ( WHERE po.id IN (1, 2, 3)) 的 3 行,相关子查询不会出错。但我认为这只是演示的简化。