我们的镶木地板文件存储在aws S3存储桶中,并由SNAPPY压缩.我能够使用python fastparquet模块读取未压缩版本的镶木地板文件,但不能读取压缩版本.
这是我用于未压缩的代码
s3 = s3fs.S3FileSystem(key='XESF', secret='dsfkljsf')
myopen = s3.open
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.parquet', open_with=myopen)
df=pf.to_pandas()
Run Code Online (Sandbox Code Playgroud)
这返回没有错误但是当我尝试读取文件的snappy压缩版本时:
pf = ParquetFile('sample/py_test_snappy/part-r-12423423942834.snappy.parquet', open_with=myopen)
Run Code Online (Sandbox Code Playgroud)
我得到了to_pandas()的错误
df=pf.to_pandas()
Run Code Online (Sandbox Code Playgroud)
错误信息
()----> 1 df = pf.to_pandas()中的KeyErrorTraceback(最近一次调用last)
/opt/conda/lib/python3.5/site-packages/fastparquet/api.py in_pandas(self,columns,categories,filters,index)293 for views(item,v)in views.items()} 294 self. read_row_group(rg,columns,categories,infile = f, - > 295 index = index,assign = parts)296 start + = rg.num_rows 297 else:
read_row_group中的/opt/conda/lib/python3.5/site-packages/fastparquet/api.py(self,rg,columns,categories,infile,index,assign)151 core.read_row_group(152 infile,rg,columns,categories ,self.helper,self.cats, - > 153 self.selfmade,index = index,assign = assign)154 if ret:155 return df
read_row_group中的/opt/conda/lib/python3.5/site-packages/fastparquet/core.py(文件,rg,列,类别,schema_helper,cats,selfmade,index,assign)300引发RuntimeError('Going with pre-分配!')301 read_row_group_arrays(文件,rg,列,类别,schema_helper, - > 302只猫,自制,assign = assign)303 304用于猫猫:
read_row_group_arrays中的/opt/conda/lib/python3.5/site-packages/fastparquet/core.py(文件,rg,列,类别,schema_helper,cats,selfmade,assign)289 …