我有一个表(data_table),其中包含多个分区列年/月/月键。
目录看起来像year=2017/month=08/monthkey=2017-08/files.parquet
下面哪个查询会更快?
select count(*) from data_table where monthkey='2017-08'
或者
select count(*) from data_table where monthkey='2017-08' and year = '2017' and month = '08'
我认为在第一种情况下 hadoop take 查找所需目录所需的初始时间会更多。但想确认一下
查找相关分区是元存储操作,而不是文件系统操作。
这是通过查询元存储而不是扫描目录来完成的。
第一个用例的 Metasore 查询很可能比第二个用例更快,但无论如何,我们在这里谈论的是几分之一秒。
create external table t100k(i int)
partitioned by (x int,y int,xy string)
;
Run Code Online (Sandbox Code Playgroud)
explain dependency select count(*) from t100k where xy='100-1000';
Run Code Online (Sandbox Code Playgroud)
针对元存储发出的查询:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where (("FILTER2"."PART_KEY_VAL" = '100-1000'))
Run Code Online (Sandbox Code Playgroud)
explain dependency select count(*) from t100k where x=100 and y=1000 and xy='100-1000';
Run Code Online (Sandbox Code Playgroud)
针对元存储发出的查询:
select "PARTITIONS"."PART_ID"
from "PARTITIONS"
inner join "TBLS" on "PARTITIONS"."TBL_ID" = "TBLS"."TBL_ID" and "TBLS"."TBL_NAME" = 't100k'
inner join "DBS" on "TBLS"."DB_ID" = "DBS"."DB_ID" and "DBS"."NAME" = 'local_db'
inner join "PARTITION_KEY_VALS" "FILTER0" on "FILTER0"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER0"."INTEGER_IDX" = 0
inner join "PARTITION_KEY_VALS" "FILTER1" on "FILTER1"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER1"."INTEGER_IDX" = 1
inner join "PARTITION_KEY_VALS" "FILTER2" on "FILTER2"."PART_ID" = "PARTITIONS"."PART_ID" and "FILTER2"."INTEGER_IDX" = 2
where ( ( (((case when "FILTER0"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER0"."PART_KEY_VAL" as decimal(21,0)) else null end) = 100)
and ((case when "FILTER1"."PART_KEY_VAL" <> '__HIVE_DEFAULT_PARTITION__' then cast("FILTER1"."PART_KEY_VAL" as decimal(21,0)) else null end) = 1000))
and ("FILTER2"."PART_KEY_VAL" = '100-1000')) )
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
11722 次 |
| 最近记录: |