Shi*_*kan 7 amazon-web-services presto amazon-athena
我有一个按年、月、日分区的 Athena 表,其定义如下
CREATE EXTERNAL TABLE `my_table`(
`price` double)
PARTITIONED BY (
`year` int,
`month` int,
`day` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Run Code Online (Sandbox Code Playgroud)
我需要在日期之间查询它。据我所知,选项例如是:
SELECT avg(price)
FROM my_table
WHERE year = 2018 AND month = 1
Run Code Online (Sandbox Code Playgroud)
结果:运行时间:4.89 秒,扫描数据:20.72MB
SELECT avg(price)
FROM my_table
WHERE cast(date_parse(concat(cast(year as varchar(4)),'-',
cast(month as varchar(2)),'-',
cast(day as varchar(2))
), '%Y-%m-%d') as date)
BETWEEN Date '2018-01-01' AND Date '2018-01-31'
Run Code Online (Sandbox Code Playgroud)
结果:运行时间:8.64 秒,扫描数据:20.72MB
因此,我猜 Athena 足够聪明,即使在转换串联分区列时也能使用分区功能,那么为什么它需要大约 10 分钟。两次?后台究竟发生了什么?
非常感谢。
在这种情况下,Athena 将使用 filterPredicate,您可以使用EXPLAIN ANALYZE语句来检查:
EXPLAIN ANALYZE SELECT count(*) FROM "db"."table"
where year||month||day >= '20220629';
Run Code Online (Sandbox Code Playgroud)
...
- ScanFilterProject[table = awsdatacatalog:HiveTableHandle{schemaName=db, tableName=table, analyzePartitionValues=Optional.empty}, grouped = false,
filterPredicate = ("concat"("concat"("year", "month"), "day") >= CAST('20220629' AS varchar))] => [[]]
CPU: 2.57s (99.04%), Output: 12424 rows (0B)
Input avg.: 49.11 rows, Input std.dev.: 54.32%
LAYOUT: db.table
month := month:string:-1:PARTITION_KEY
:: [[06], [07]]
year := year:string:-1:PARTITION_KEY
:: [[2022]]
day := day:string:-1:PARTITION_KEY
:: [[05], [06], [07], [11], [12], [13], [14], [15], [16], [17], [18], [19], [29], [30]]
Input: 12424 rows (5.68kB), Filtered: 0.00%
...
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
3694 次 |
最近记录: |