使用 s3-dist-cp 合并 parquet 文件

Question

使用 s3-dist-cp 合并 parquet 文件

只是想知道是否可以使用 s3-dist-cp 工具来合并镶木地板文件（快速压缩）。我尝试使用“--groupBy”和“--targetSize”选项，它确实将小文件合并为大文件。但是我无法在 Spark 或 AWS Athena 中读取它们。在 aws athena 中，我收到以下错误：

HIVE_CURSOR_ERROR: Expected 246379 values in column chunk at s3://my_analytics/parquet/auctions/region=us/year=2017/month=1/day=1/output123 offset 4 but got 247604 values instead over 1 pages ending at file offset 39

This query ran against the "randomlogdatabase" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: 4ff77c55-3b69-414d-8fd9-a3d135f5ff2f.

Run Code Online (Sandbox Code Playgroud)

任何帮助表示赞赏。

Answer 1

Ste*_*Kay 5

Parquet 文件具有重要的结构。本页详细介绍了它，但结果是元数据像 zip 文件一样存储在末尾，连接 Parquet 文件会破坏它们。要合并 Parquet 文件，您需要使用了解 Parquet 文件格式的 Spark 之类的东西。

归档时间：	8 年，1 月前
查看次数：	1830 次
最近记录：	6 年，3 月前