查询按年、月、日分区的 Athena 表

Shi*_*kan 7 amazon-web-services presto amazon-athena

我有一个按年、月、日分区的 Athena 表,其定义如下

CREATE EXTERNAL TABLE `my_table`(
    `price` double) 
PARTITIONED BY ( 
    `year` int, 
    `month` int, 
    `day` int) 
ROW FORMAT SERDE 
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
Run Code Online (Sandbox Code Playgroud)

我需要在日期之间查询它。据我所知,选项例如是:

SELECT avg(price) 
FROM my_table 
WHERE year = 2018 AND month = 1
Run Code Online (Sandbox Code Playgroud)

结果:运行时间:4.89 秒,扫描数据:20.72MB

SELECT avg(price) 
FROM my_table 
WHERE cast(date_parse(concat(cast(year as varchar(4)),'-',
                             cast(month as varchar(2)),'-',
                             cast(day as varchar(2))
                             ), '%Y-%m-%d') as date) 
BETWEEN Date '2018-01-01' AND Date '2018-01-31'
Run Code Online (Sandbox Code Playgroud)

结果:运行时间:8.64 秒,扫描数据:20.72MB

因此,我猜 Athena 足够聪明,即使在转换串联分区列时也能使用分区功能,那么为什么它需要大约 10 分钟。两次?后台究竟发生了什么?

非常感谢。

Ale*_*sov 0

在这种情况下,Athena 将使用 filterPredicate,您可以使用EXPLAIN ANALYZE语句来检查:

    EXPLAIN ANALYZE SELECT count(*) FROM "db"."table" 
    where year||month||day >= '20220629';
Run Code Online (Sandbox Code Playgroud)
...
        - ScanFilterProject[table = awsdatacatalog:HiveTableHandle{schemaName=db, tableName=table, analyzePartitionValues=Optional.empty}, grouped = false, 
filterPredicate = ("concat"("concat"("year", "month"), "day") >= CAST('20220629' AS varchar))] => [[]]
                CPU: 2.57s (99.04%), Output: 12424 rows (0B)
                Input avg.: 49.11 rows, Input std.dev.: 54.32%
                LAYOUT: db.table
                month := month:string:-1:PARTITION_KEY
                    :: [[06], [07]]
                year := year:string:-1:PARTITION_KEY
                    :: [[2022]]
                day := day:string:-1:PARTITION_KEY
                    :: [[05], [06], [07], [11], [12], [13], [14], [15], [16], [17], [18], [19], [29], [30]]
                Input: 12424 rows (5.68kB), Filtered: 0.00%
...
Run Code Online (Sandbox Code Playgroud)