如何使用 Boto3 从 S3 将压缩的镶木地板文件读入 Python？

Question

如何使用 Boto3 从 S3 将压缩的镶木地板文件读入 Python？

Cor*_*son 1 python amazon-s3 amazon-web-services parquet boto3

我data.parquet.gzip在 S3 存储桶上调用了一个文件。我无法弄清楚阅读它有什么问题。通常我已经使用过，StringIO但我不知道如何解决它。我想使用 Pandas 和 boto3 将它从 S3 导入到我的 Python jupyter notebook 会话中。

Answer 1

Cor*_*son 5

解决方案实际上非常简单。

import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets

# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
                  aws_access_key_id=ACCESS_KEY_HERE,
                  aws_secret_access_key=SECRET_ACCESS_KEY_HERE)

# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)

# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()

# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))

Run Code Online (Sandbox Code Playgroud)

自从我发表这篇文章以来已经有几年了。如今，您可以直接将“pandas.read_parquet”与 S3 路径一起使用，即使它是经过 gzip 压缩的 parquet，也是我的首选方法。因此，您只需执行“df = pd.read_parquet('s3://bucket-name/path/to/df.parquet.gzip')”即可，它应该可以工作！ (3认同)

归档时间：	7 年，1 月前
查看次数：	2618 次
最近记录：	7 年，1 月前