Ser*_*kov 6 python pandas parquet pyarrow
我想使用 PyArrow 将以下 Pandas 数据框存储在镶木地板文件中:
import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})
Run Code Online (Sandbox Code Playgroud)
field列的类型是字典列表:
field
0 [{}, {}]
Run Code Online (Sandbox Code Playgroud)
我首先定义相应的 PyArrow 架构:
import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])
Run Code Online (Sandbox Code Playgroud)
然后我使用from_pandas():
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
Run Code Online (Sandbox Code Playgroud)
这将引发以下异常:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
convert_types)]
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
for c, t in zip(columns_to_convert,
File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
return pa.array(col, from_pandas=True, type=ty)
File "array.pxi", line 177, in pyarrow.lib.array
File "error.pxi", line 77, in pyarrow.lib.check_status
File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>
Run Code Online (Sandbox Code Playgroud)
我做错了什么还是 PyArrow 不支持?
我使用 pyarrow 0.9.0、pandas 23.4、python 3.6。
根据这个 Jira 问题,在 2.0.0 版本中实现了混合使用结构和列表嵌套级别读取和写入嵌套 Parquet 数据。
以下示例通过往返演示了实现的功能:pandas 数据框 -> parquet 文件 -> pandas 数据框。使用的 PyArrow 版本是 3.0.0。
最初的 pandas 数据框有一个字典类型列表的字段和一个条目:
field
0 [{'a': 1}, {'a': 2}]
Run Code Online (Sandbox Code Playgroud)
示例代码:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet
df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
schema = pa.schema(
[pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
pyarrow.parquet.write_table(table_write, 'test.parquet')
table_read = pyarrow.parquet.read_table('test.parquet')
table_read.to_pandas()
Run Code Online (Sandbox Code Playgroud)
输出数据帧与输入数据帧相同,应该是:
field
0 [{'a': 1}, {'a': 2}]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
3889 次 |
| 最近记录: |