具有多个分区的 Hive 表

San*_*ver 2 hive hiveql

我有一个表(data_table),其中包含多个分区列年/月/月键。

目录看起来像year=2017/month=08/monthkey=2017-08/files.parquet

下面哪个查询会更快?

select count(*) from data_table where monthkey='2017-08'

或者

select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'

我认为在第一种情况下 hadoop take 查找所需目录所需的初始时间会更多。但想确认一下

Dav*_*itz 5

查找相关分区是元存储操作,而不是文件系统操作。
这是通过查询元存储而不是扫描目录来完成的。
第一个用例的 Metasore 查询很可能比第二个用例更快,但无论如何,我们在这里谈论的是几分之一秒。

演示

create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
Run Code Online (Sandbox Code Playgroud)
explain dependency select count(*) from t100k where xy='100-1000';
Run Code Online (Sandbox Code Playgroud)

针对元存储发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
 
Run Code Online (Sandbox Code Playgroud)
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';
Run Code Online (Sandbox Code Playgroud)

针对元存储发出的查询:

select "PARTITIONS"."PART_ID" 
from "PARTITIONS"  
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID"     and "TBLS"."TBL_NAME" = 't100k'   
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID"      and "DBS"."NAME" = 'local_db' 
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0 
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1 
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2 
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100) 
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))  
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )
Run Code Online (Sandbox Code Playgroud)