将 Django UploadedFile 读入 Pandas DataFrame

Sea*_*hon 6 django pandas

我正在尝试将上传到 Django 的 .csv 文件读取到 DataFrame 中。

我正在按照说明和 Django REST Framework 页面上传文件。当我PUT将 .csv 文件发送到定义的端点时,我最终会得到一个 Django UploadedFile对象,特别是一个TemporaryUploadedFile.

我正在尝试使用 将此对象读入熊猫数据框read_csv,但是,临时上传的文件周围还有其他格式。我想知道如何读取上传的原始 .csv 文件。

根据 DRF 文档,我分配了:

file_obj = request.data['file']
Run Code Online (Sandbox Code Playgroud)

在 Python 调试控制台中,我看到:

ipdb> file_obj                                                                                                                                                                            
<TemporaryUploadedFile: foobar.csv (multipart/form-data; boundary=--------------------------044608164241682586561733)>
Run Code Online (Sandbox Code Playgroud)

到目前为止我尝试过的事情。

使用原始文件路径,我可以像这样将其读入熊猫。

dataframe = pd.read_csv(open("foobar.csv", "rb"))
Run Code Online (Sandbox Code Playgroud)

但是,原始文件有 Django 在上传过程中添加的额外元数据。

ipdb> pd.read_csv(open(file_obj.temporary_file_path(), "rb"))                                                                                                                             
*** pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 32
Run Code Online (Sandbox Code Playgroud)

如果我尝试使用该UploadedFile.read()方法,则会遇到以下问题。

ipdb> dataframe = pd.read_csv(file_obj.read())                                                                                                                                            
*** OSError: Expected file path name or file-like object, got <class 'bytes'> type
Run Code Online (Sandbox Code Playgroud)

谢谢!

PS 原始文件的前几行是这样的。

SPID,SA_ID,UOM,DIR,DATE,RS,NAICS,APCT,1:00,2:00,3:00,4:00,5:00,6:00,7:00,8:00,9:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,0:00:00
(Blanked),123456789,KWH,R,5/2/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0.144,1.064,3.07,4.531,4.013,5.205,4.751,4.647,3.142,2.464,1.173,0.023,0,0,0,0,0
(Blanked),123456789,KWH,R,3/10/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0,0.007,0.622,0.179,0.003,0.274,0.167,0.014,0.004,0.028,0.139,0,0,0,0,0,0
Run Code Online (Sandbox Code Playgroud)

当我查看临时文件的内容时,我看到了这一点。

----------------------------789873173211443224653494
Content-Disposition: form-data; name="file"; filename="foobar.csv"
Content-Type: File

SPID,SA_ID,UOM,DIR,DATE,RS,NAICS,APCT,1:00,2:00,3:00,4:00,5:00,6:00,7:00,8:00,9:00,10:00,11:00,12:00,13:00,14:00,15:00,16:00,17:00,18:00,19:00,20:00,21:00,22:00,23:00,0:00:00
(Blanked),123456789,KWH,R,5/2/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0.144,1.064,3.07,4.531,4.013,5.205,4.751,4.647,3.142,2.464,1.173,0.023,0,0,0,0,0
(Blanked),123456789,KWH,R,3/10/18,H2ETOUAN,,100,0,0,0,0,0,0,0,0,0.007,0.622,0.179,0.003,0.274,0.167,0.014,0.004,0.028,0.139,0,0,0,0,0,0
Run Code Online (Sandbox Code Playgroud)

小智 8

UploadedFile.read() 以字节为单位返回文件数据,而不是文件路径或类似文件的对象。为了使用 pandas read_csv() 函数,您需要将这些字节转换为流。由于您的文件是 csv,最直接的方法是将bytes.decode()io.StringIO() 一起使用,例如:

dataframe = pd.read_csv(io.StringIO(file_obj.read().decode('utf-8')), delimiter=',')
Run Code Online (Sandbox Code Playgroud)