使用 Postgres 在日期范围内连接两个大表

nsy*_*mms 4 postgresql performance query-performance

我知道这个问题必须经常出现,但我似乎找不到最近的答案来解决更简单的问题,比如我正在尝试做的事情。我读过几个类似的问题。

我正在使用 PostgreSQL 9.4 并有 2 个表,每个表都有接近 200M 的行,包含两种类型的指标。列是 the_date、feature_A、feature_B 和 24 小时指标。两个表中的指标差异很大,因此无法合并这些表。或者至少我不这么认为。某些指标不适用于所有功能,这就是将它们分开的原因。两个表在日期列上都有索引。

当我尝试执行连接它们的查询时,如果我指定一个日期,查询将在大约 3 秒内返回,这对我来说似乎是合理的。但是,如果我只是简单地进行a.the_date between '8/1/2017' and '8/1/2017'查询,现在基本上需要永远。我在 10 分钟后杀死它,因为我不耐烦。显然 PostgreSQL 不再使用索引。

为什么会这样,我怎样才能让它再次使用索引?我想做像a.the_date between '7/1/2017' and '7/31/2017'or 之类的查询a.the_date >= '1/1/2017'。现在我只是有一个 python 程序,一次运行一个日期的查询,然后汇总结果,这似乎是一个黑客。但至少它是可预测的,并且每天 3 秒快得多。

查询:

select a.feature_a, a.feature_b, count(*) tot_ct,
    count(case when b.br01 > 0 and a.bc01 <= 10 then 1 end) cx01,
    count(case when b.br02 > 0 and a.bc02 <= 10 then 1 end) cx02,
    count(case when b.br03 > 0 and a.bc03 <= 10 then 1 end) cx03,
    count(case when b.br04 > 0 and a.bc04 <= 10 then 1 end) cx04,
    count(case when b.br05 > 0 and a.bc05 <= 10 then 1 end) cx05,
    count(case when b.br06 > 0 and a.bc06 <= 10 then 1 end) cx06,
    count(case when b.br07 > 0 and a.bc07 <= 10 then 1 end) cx07,
    count(case when b.br08 > 0 and a.bc08 <= 10 then 1 end) cx08,
    count(case when b.br09 > 0 and a.bc09 <= 10 then 1 end) cx09,
    count(case when b.br10 > 0 and a.bc10 <= 10 then 1 end) cx10,
    count(case when b.br11 > 0 and a.bc11 <= 10 then 1 end) cx11,
    count(case when b.br12 > 0 and a.bc12 <= 10 then 1 end) cx12,
    count(case when b.br13 > 0 and a.bc13 <= 10 then 1 end) cx13,
    count(case when b.br14 > 0 and a.bc14 <= 10 then 1 end) cx14,
    count(case when b.br15 > 0 and a.bc15 <= 10 then 1 end) cx15,
    count(case when b.br16 > 0 and a.bc16 <= 10 then 1 end) cx16,
    count(case when b.br17 > 0 and a.bc17 <= 10 then 1 end) cx17,
    count(case when b.br18 > 0 and a.bc18 <= 10 then 1 end) cx18,
    count(case when b.br19 > 0 and a.bc19 <= 10 then 1 end) cx19,
    count(case when b.br20 > 0 and a.bc20 <= 10 then 1 end) cx20,
    count(case when b.br21 > 0 and a.bc21 <= 10 then 1 end) cx21,
    count(case when b.br22 > 0 and a.bc22 <= 10 then 1 end) cx22,
    count(case when b.br23 > 0 and a.bc23 <= 10 then 1 end) cx23,
    count(case when b.br24 > 0 and a.bc24 <= 10 then 1 end) cx24,
    avg(b.br01) av01, avg(b.br02) av02, avg(b.br03) av03,
    avg(b.br04) av04, avg(b.br05) av05, avg(b.br06) av06,
    avg(b.br07) av07, avg(b.br08) av08, avg(b.br09) av09,
    avg(b.br10) av10, avg(b.br11) av11, avg(b.br12) av12,
    avg(b.br13) av13, avg(b.br14) av14, avg(b.br15) av15,
    avg(b.br16) av16, avg(b.br17) av17, avg(b.br18) av18,
    avg(b.br19) av19, avg(b.br20) av20, avg(b.br21) av21,
    avg(b.br22) av22, avg(b.br23) av23, avg(b.br24) av24
from table_a a, table_b b
where a.the_date = '8/1/2017'
    and a.the_date = b.the_date
    and a.feature_a = b.feature_a
    and a.feature_b = b.feature_b
group by a.feature_a, a.feature_b
Run Code Online (Sandbox Code Playgroud)

更新:

对上述查询的解释:

GroupAggregate  (cost=3123833.79..5157342.05 rows=201 width=680)                                                         
Group Key: a.feature_a, a.feature_b                                                                                    
->  Merge Join  (cost=3123833.79..3220488.32 rows=5270312 width=680)                                                   
        Merge Cond: ((a.feature_a = b.feature_a) AND (a.feature_b = b.feature_b))                                        
        ->  Sort  (cost=1438291.32..1439351.90 rows=424234 width=348)                                                    
            Sort Key: a.feature_a, a.feature_b                                                                         
            ->  Bitmap Heap Scan on table_a a  (cost=14960.38..1262333.10 rows=424234 width=348)               
                    Recheck Cond: (the_day = '2017-08-01'::date)                                                   
                    ->  Bitmap Index Scan on table_a_day_idx  (cost=0.00..14854.32 rows=424234 width=0)        
                        Index Cond: (the_day = '2017-08-01'::date)                                               
        ->  Materialize  (cost=1685542.47..1688027.10 rows=496925 width=348)
            ->  Sort  (cost=1685542.47..1686784.79 rows=496925 width=348)
                    Sort Key: b.feature_a, b.feature_b                                                                   
                    ->  Bitmap Heap Scan on table_b b  (cost=17759.74..1478863.74 rows=496925 width=348)
                        Recheck Cond: (the_day = '2017-08-01'::date)                                             
                        ->  Bitmap Index Scan on table_b_day_idx (cost=0.00..17635.51 rows=496925 width=0)
                                Index Cond: (the_day = '2017-08-01'::date)
Run Code Online (Sandbox Code Playgroud)

不使用 table_b 索引的日期范围查询(8/1 到 8/5)的解释:

GroupAggregate  (cost=84467702.82..87504872.78 rows=201 width=680)
Group Key: a.feature_a, a.feature_b
->  Merge Join  (cost=84467702.82..85568019.42 rows=5270311 width=680)
        Merge Cond: ((b.feature_a = a.feature_a) AND (b.feature_b = a.feature_b) AND (b.the_day = a.the_day))  
        ->  Sort  (cost=83027290.33..83275752.78 rows=99384980 width=348)                                                    
            Sort Key: b.feature_a, b.feature_b, b.the_day
            ->  Seq Scan on table_b b  (cost=0.00..5963098.80 rows=99384980 width=348)
        ->  Materialize  (cost=1440412.49..1442533.66 rows=424234 width=348)
            ->  Sort  (cost=1440412.49..1441473.07 rows=424234 width=348)
                    Sort Key: a.feature_a, a.feature_b, a.the_day
                    ->  Bitmap Heap Scan on table_a a  (cost=16020.97..1264454.27 rows=424234 width=348)
                        Recheck Cond: ((the_day >= '2017-08-01'::date) AND (the_day <= '2017-08-05'::date))    
                        ->  Bitmap Index Scan on table_a_day_idx  (cost=0.00..15914.91 rows=424234 width=0)      
                                Index Cond: ((the_day >= '2017-08-01'::date) AND (the_day <= '2017-08-05'::date))
Run Code Online (Sandbox Code Playgroud)

nsy*_*mms 5

解决了我的索引问题,对于可能会发现此问题的任何人...

就像将日期范围添加到两个表的 where 子句一样简单。现在两个索引都被使用了,不再进行全表扫描。

and a.the_date between '7/1/2017' and '7/31/2017'
and b.the_date between '7/1/2017' and '7/31/2017'
and a.the_date = b.the_date
Run Code Online (Sandbox Code Playgroud)

似乎没有必要,但我记得多年前优化 Oracle 查询时做过这样的事情。有时我什至不得不做这样的事情,a.col1 = b.col1 and a.col1 = b.col1因为=在那里有条款是不够的。