使用AWS Lambda（Python 3）读取存储在S3中的Parquet文件

Question

使用AWS Lambda（Python 3）读取存储在S3中的Parquet文件

Pta*_*tah 6 python amazon-s3 parquet aws-lambda pyarrow

我正在尝试使用AWS Lambda在S3中加载，处理和编写Parquet文件。我的测试/部署过程是：

https://github.com/lambci/docker-lambda作为模拟Amazon环境的容器，因为需要安装本机库（其中包括numpy）。
生成zip文件的过程如下：http : //docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python
将测试python函数添加到zip中，将其发送到S3，更新lambda并对其进行测试

似乎有两种可能的方法，它们都在docker容器本地工作：

使用s3fs的fastparquet：不幸的是，该软件包的未压缩大小大于256MB，因此我无法使用它更新Lambda代码。
pyarrow与s3fs：我遵循了https://github.com/apache/arrow/pull/916，当使用lambda函数执行时，我得到了：
- 如果我前缀S3或S3N的URI（如在代码示例）：在lambda环境OSError: Passed non-file path: s3://mybucket/path/to/myfile中pyarrow / parquet.py，线848局部我得到IndexError: list index out of range在pyarrow / parquet.py，线714
- 如果我不使用S3或S3N作为URI的前缀：它可以在本地工作（我可以读取镶木地板数据）。在Lambda环境中，我OSError: Passed non-file path: s3://mybucket/path/to/myfile在pyarrow / parquet.py的第848行中得到了相同的结果。

我的问题是：

为什么在Docker容器中得到的结果与在Lambda环境中得到的结果不同？
给出URI的正确方法是什么？
是否可以通过AWS Lambda读取S3中的Parquet文件？

谢谢！

Answer 1

Igo*_*res 6

AWS 有一个项目 ( AWS Data Wrangler ) 允许它具有完整的 Lambda 层支持。

在 Docs 中有一步一步来做到这一点。

代码示例：

import awswrangler as wr

# Write
wr.s3.to_parquet(
    dataframe=df,
    path="s3://...",
    dataset=True,
    database="my_database",  # Optional, only with you want it available on Athena/Glue Catalog
    table="my_table",
    partition_cols=["PARTITION_COL_NAME"])

# READ
df = wr.s3.read_parquet(path="s3://...")

Run Code Online (Sandbox Code Playgroud)

参考

Answer 2

Pta*_*tah 2

这是一个环境问题（VPC 中的 Lambda 无法访问存储桶）。Pyarrow 现在正在工作。
希望这个问题本身能够很好地概述如何使所有这些工作发挥作用。

归档时间：	8 年，4 月前
查看次数：	4389 次
最近记录：	7 年，11 月前