如何避免 AWS Athena CTAS 查询创建小文件？

Question

如何避免 AWS Athena CTAS 查询创建小文件？

jon*_*jon 4 amazon-web-services amazon-athena

我无法弄清楚我的 CTAS 查询出了什么问题，即使我没有提到任何分桶列，它也会在存储在分区内时将数据分解成更小的文件。有没有办法避免这些小文件并将每个分区存储为一个文件，因为小于 128 MB 的文件会导致额外的开销？

CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
    format = 'PARQUET'
    parquet_compression = 'GZIP',
    external_location='s3://mybucket/Athena/tables/parquet/'
    partitioned_by=ARRAY['year','month']
)
AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y')  AS year,
    date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c')  AS month
FROM sampleDB.yellow_trip_data_raw;

Run Code Online (Sandbox Code Playgroud)

来自我的分区的图像

Answer 1

jon*_*jon 5

我能够通过创建一个分桶列来克服这个问题month_a。下面是代码

CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
    format = 'AVRO',
    external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
    partitioned_by=ARRAY['year','month'],
    bucketed_by=ARRAY['month_a'],
    bucket_count=12
) AS SELECT
    VendorID,
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    passenger_count,
    trip_distance,
    RatecodeID,
    store_and_fwd_flag,
    PULocationID,
    DOLocationID,
    payment_type,
    fare_amount,
    extra,
    mta_tax,
    tip_amount,
    tolls_amount,
    improvement_surcharge,
    total_amount,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
    date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

Run Code Online (Sandbox Code Playgroud)

Answer 2

The*_*heo 1

Athena 是一个分布式系统，它将通过某种不可观察的机制来扩展查询的执行。看起来它决定使用 5 个工作线程来执行 CTAS 查询，这将导致每个分区中产生 5 个文件。

您可以尝试显式指定存储桶大小为 1，但如果我没记错的话，您可能仍然会获得多个文件。

归档时间：	6 年，11 月前
查看次数：	1843 次
最近记录：	6 年，6 月前