如何优化索引列上的 IN 查询

Ant*_*ony 4 postgresql performance index optimization query-performance

我有一张超过 5000 万条记录的表。其中一个字段是COLOR_CODE。我在列上设置了一个索引,COLOR_CODE如下所示:

"mytable_colorcode_idx" btree (color_code)
Run Code Online (Sandbox Code Playgroud)

我注意到当我运行下面的查询时,执行时间更长

SELECT count(total_amount) FROM mytable 
WHERE color_code in ('red','green') and sale_date = '1970'
Run Code Online (Sandbox Code Playgroud)

但是,使用OR子句执行时间更快:

SELECT count(total_amount) FROM mytable 
WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'
Run Code Online (Sandbox Code Playgroud)

查询计划 IN

explain analyze SELECT count(total_amount) FROM mytable 
WHERE color_code in ('red','green') and sale_date = '1970'
                                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2074238.07..2074238.08 rows=1 width=8) (actual time=63520.150..63520.150 rows=1 loops=1)
   ->  Bitmap Heap Scan on mytable  (cost=53504.73..2069923.27 rows=1725919 width=6) (actual time=3509.920..63080.519 rows=1727037 loops=1)
         Recheck Cond: ((color_code)::text = ANY ('{red,green}'::text[]))
         Rows Removed by Index Recheck: 5067635
         Filter: (sale_date = 1970)
         Heap Blocks: exact=38679 lossy=496680
         ->  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..53073.26 rows=1725919 width=0) (actual time=3501.777..3501.777 rows=1727037 loops=1)
               Index Cond: ((color_code)::text = ANY ('{red,green}'::text[]))
 Planning time: 0.165 ms
 Execution time: 63524.100 ms
(10 rows)
Run Code Online (Sandbox Code Playgroud)

查询计划 OR

explain analyze SELECT count(total_amount) FROM mytable 
    WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'

    QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2081265.36..2081265.37 rows=1 width=8) (actual time=18895.998..18895.998 rows=1 loops=1)
   ->  Bitmap Heap Scan on mytable  (cost=56223.06..2076956.39 rows=1723588 width=6) (actual time=161.335..18468.146 rows=1727037 loops=1)
         Recheck Cond: (((color_code)::text = 'red'::text) OR ((color_code)::text = 'green'::text))
         Rows Removed by Index Recheck: 5067635
         Filter: (((color_code)::text = 'red'::text) OR (((color_code)::text = 'green'::text) AND (sale_date = 1970)))
         Heap Blocks: exact=38679 lossy=496680
         ->  BitmapOr  (cost=56223.06..56223.06 rows=1725919 width=0) (actual time=153.683..153.684 rows=0 loops=1)
               ->  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..663.35 rows=20655 width=0) (actual time=3.935..3.935 rows=26768 loops=1)
                     Index Cond: ((color_code)::text = 'red'::text)
               ->  Bitmap Index Scan on mytable_colorcode_idx  (cost=0.00..54697.91 rows=1705264 width=0) (actual time=149.745..149.746 rows=1700269 loops=1)
                     Index Cond: ((color_code)::text = 'green'::text)
 Planning time: 0.162 ms
 Execution time: 18896.785 ms
(13 rows)
Run Code Online (Sandbox Code Playgroud)

更新

如果我添加一个索引(color_code、total_count 和 sale_date),我会注意到根本没有使用任何索引。相反,它会进行部分扫描。

"mytable_color_total_count_sale_Date_idx" btree (color_code, total_count, sale_date)  



                                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=2099755.26..2099755.27 rows=1 width=8) (actual time=97066.585..97066.586 rows=1 loops=1)
   ->  Gather  (cost=2099755.04..2099755.25 rows=2 width=8) (actual time=97063.512..97069.838 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=2098755.04..2098755.05 rows=1 width=8) (actual time=97061.531..97061.532 rows=1 loops=3)
               ->  Parallel Seq Scan on mytable  (cost=0.00..2096119.69 rows=1054140 width=6) (actual time=27782.491..96730.232 rows=841604 loops=3)
                     Filter: ((sale_date = 1970) AND ((color_code)::text = ANY ('{red,green}'::text[])))
                     Rows Removed by Filter: 4196103
 Planning time: 0.161 ms
 Execution time: 97069.896 ms
(10 rows)
Run Code Online (Sandbox Code Playgroud)

IN除了将其转换为OR子句之外,有没有一种方法可以通过子句查询进行优化?

Len*_*art 5

您无法比较以下各项的性能:

WHERE color_code in ('red','green') and sale_date = '1970'
Run Code Online (Sandbox Code Playgroud)

和:

WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'
Run Code Online (Sandbox Code Playgroud)

因为它们在逻辑上不等价(将返回不同的结果)。一个简单的例子:

 with T (color_code, sale_date) as ( 
     values ('red', '1970'), ('green','1969')
 ) 
 select * from T 
 where color_code in ('green', 'red') 
   and sale_date = '1970';

 color_code | sale_date 
------------+-----------
 red        | 1970
(1 row)
Run Code Online (Sandbox Code Playgroud)

然而:

with T (color_code, sale_date) as ( 
    values ('red', '1970'), ('green','1969')
) 
select * from T 
where color_code = 'green' or color_code = 'red' 
  and sale_date = '1970';

color_code | sale_date 
------------+-----------
 red        | 1970
 green      | 1969
(2 rows)
Run Code Online (Sandbox Code Playgroud)

简而言之AND具有更高的优先级,OR因此您的优化表达式A OR B AND C被评估为A OR (B AND C). 您的原始表达式被评估为(A OR B) AND C

为了使比较有意义,您需要将查询更改为:

select * from T 
where (color_code = 'green' or color_code = 'red') 
  and sale_date = '1970';
Run Code Online (Sandbox Code Playgroud)

我的猜测是,您不会在性能方面看到与您的原始表达有太大差异。

也就是说,我建议使用如下索引:

CREATE INDEX ... ON ... (sale_date, color_code)
Run Code Online (Sandbox Code Playgroud)

  • 我认为您认为 OP 帖子中的两个查询在逻辑上不相同,因此来自 OP 的问题有些无效(基于两个逻辑上不同的查询的结果)。 (2认同)