将pandas df转换为parquet-file-bytes-object

Question

将pandas df转换为parquet-file-bytes-object

我有一个pandas数据框，并希望将其作为拼合文件写入Azure文件存储中。

到目前为止，我还无法将数据帧直接转换为字节，然后可以将其上载到Azure。我当前的解决方法是将其作为拼写文件保存到本地驱动器，然后将其读取为字节对象，然后将其上传到Azure。

谁能告诉我如何将熊猫数据框直接转换为“ parquet file” -bytes对象而无需将其写入磁盘？I / O操作确实在减慢速度，感觉就像是非常丑陋的代码...

# Transform the data_frame into a parquet file on the local drive    
data_frame.to_parquet('temp_p.parquet', engine='auto', compression='snappy')

# Read the parquet file as bytes.
with open("temp_p.parquet", mode='rb') as f:
     fileContent = f.read()

     # Upload the bytes object to Azure
     service.create_file_from_bytes(share_name, file_path, file_name, fileContent, index=0, count=len(fileContent))

Run Code Online (Sandbox Code Playgroud)

我正在寻找实现这样的东西，其中transform_functionality返回一个byte对象：

my_bytes = data_frame.transform_functionality()
service.create_file_from_bytes(share_name, file_path, file_name, my_bytes, index=0, count=len(my_bytes))

Run Code Online (Sandbox Code Playgroud)

Answer 1

Cri*_*ber 5

我已经找到了解决方案，如果有人需要执行相同任务，我将在此处发布。使用to_parquet文件将其写入缓冲区后，我使用_.getvalue（）功能将bytes对象从缓冲区中取出，如下所示：

    buffer = BytesIO()
    data_frame.to_parquet(buffer, engine='auto', compression='snappy')

    service.create_file_from_bytes(share_name, file_path, file_name, \
                buffer.getvalue(), index=0, count=buffer.getbuffer().nbytes )

Run Code Online (Sandbox Code Playgroud)

FWIW，在使用python 3.6.1，pandas 0.24.0，pyarrow 0.9.0和fastparquet 0.2.1进行测试时，此解决方案仅适用于pyarrow引擎。使用fastparquet会产生TypeError：预期的str，字节或os.PathLike对象，而不是_io.BytesIO。 (6认同)

归档时间：	6 年，12 月前
查看次数：	515 次
最近记录：	6 年，12 月前