使用分块将 Pandas DataFrame 写入字符串缓冲区

met*_*rsk 3 python amazon-s3 pandas

我有一个 10k 行的 csv,我想以 1k 行的块写入 s3。

from io import StringIO

import pandas as pd

csv_buffer = StringIO()
df.to_csv(csv_buffer, chunksize=1000)
s3_resource = boto3.resource('s3')
s3_resource.Object(bucket, 'df.csv').put(Body=csv_buffer.getvalue())
Run Code Online (Sandbox Code Playgroud)

这给了我要写入 s3 的字符串缓冲区中的前 1k 行,但似乎 csv 缓冲区不是我可以循环的迭代器。

有谁知道如何实现这一目标?

Bra*_*mon 5

看起来StringIO并没有真正注意块大小。(.readlines()将始终只返回一行,而不是一大块行。)

我对 boto3 不太熟悉,但itertools.islice可能对你有用,因为需要在不创建一些中间数据结构的情况下对可迭代对象进行切片。

如果这看起来适合您的需求,我可以在代码旁边添加一些解释:

>>> from io import StringIO
... from itertools import islice
... import sys
... 
... import numpy as np
... import pandas as pd
... 
... df = pd.DataFrame(np.arange(300).reshape(100, -1))
... csv_buffer = StringIO()
... df.to_csv(csv_buffer)
... csv_buffer.seek(0)
... 
... # Account for indivisibility (scoop up a remainder on the final slice).
... chunksize = 33
... rowsize = df.shape[1]
... slices = [(0, chunksize)] * (rowsize - 1) + [(0, sys.maxsize)]
... chunks = (tuple(islice(csv_buffer, i, j)) for i, j in slices)
... 

>>> next(chunks)
(',0,1,2\n',
 '0,0,1,2\n',
 '1,3,4,5\n',
 '2,6,7,8\n',
 '3,9,10,11\n',
 '4,12,13,14\n',
 '5,15,16,17\n',
 '6,18,19,20\n',
 '7,21,22,23\n',
 '8,24,25,26\n',
 '9,27,28,29\n',
 '10,30,31,32\n',
 '11,33,34,35\n',
 '12,36,37,38\n',
 '13,39,40,41\n',
 '14,42,43,44\n',
 '15,45,46,47\n',
 '16,48,49,50\n',
 '17,51,52,53\n',
 '18,54,55,56\n',
 '19,57,58,59\n',
 '20,60,61,62\n',
 '21,63,64,65\n',
 '22,66,67,68\n',
 '23,69,70,71\n',
 '24,72,73,74\n',
 '25,75,76,77\n',
 '26,78,79,80\n',
 '27,81,82,83\n',
 '28,84,85,86\n',
 '29,87,88,89\n',
 '30,90,91,92\n',
 '31,93,94,95\n')
Run Code Online (Sandbox Code Playgroud)