whi*_*olf 5 python pandas parquet pyarrow
我有一个结构如下的数据框:
Coumn1 Coumn2
0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,...
1 (0.00015607803652528673, 0.0001314736582571640... (0.0022136708721518517, 0.0014974646037444472,...
2 (0.011317798867821693, 0.011339936405420303, 0... (0.004868391435593367, 0.004406007472425699, 0...
3 (3.94578673876822e-05, 3.075833956245333e-05, ... (0.0075020878575742245, 0.0096737677231431, 0....
4 (0.0004926157998852432, 0.0003811710048466921,... (0.010351942852139473, 0.008231297135353088, 0...
.. ... ...
130 (0.011190211400389671, 0.011337820440530777, 0... (0.010182800702750683, 0.011351295746862888, 0...
131 (0.006286659277975559, 0.007315031252801418, 0... (0.02104150503873825, 0.02531484328210354, 0.0...
132 (0.0022791570518165827, 0.0025983047671616077,... (0.008847278542816639, 0.009222050197422504, 0...
133 (0.0007059817435219884, 0.0009831463685259223,... (0.0028264704160392284, 0.0029402063228189945,...
134 (0.0018992726691067219, 0.002058899961411953, ... (0.0019639385864138603, 0.002009353833273053, ...
[135 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
其中每个单元格保存一些浮点值的列表/元组:
type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>
Run Code Online (Sandbox Code Playgroud)
(每个单元格条目在元组中包含相同数量的条目)
当我现在尝试将数据帧保存为镶木地板时,出现错误(fastparquet):
Coumn1 Coumn2
0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212, 0.0014970217598602176,...
1 (0.00015607803652528673, 0.0001314736582571640... (0.0022136708721518517, 0.0014974646037444472,...
2 (0.011317798867821693, 0.011339936405420303, 0... (0.004868391435593367, 0.004406007472425699, 0...
3 (3.94578673876822e-05, 3.075833956245333e-05, ... (0.0075020878575742245, 0.0096737677231431, 0....
4 (0.0004926157998852432, 0.0003811710048466921,... (0.010351942852139473, 0.008231297135353088, 0...
.. ... ...
130 (0.011190211400389671, 0.011337820440530777, 0... (0.010182800702750683, 0.011351295746862888, 0...
131 (0.006286659277975559, 0.007315031252801418, 0... (0.02104150503873825, 0.02531484328210354, 0.0...
132 (0.0022791570518165827, 0.0025983047671616077,... (0.008847278542816639, 0.009222050197422504, 0...
133 (0.0007059817435219884, 0.0009831463685259223,... (0.0028264704160392284, 0.0029402063228189945,...
134 (0.0018992726691067219, 0.002058899961411953, ... (0.0019639385864138603, 0.002009353833273053, ...
[135 rows x 2 columns]
Run Code Online (Sandbox Code Playgroud)
完整堆栈跟踪: https: //pastebin.com/8Myu8hNV
我也用其他引擎 pyarrow 尝试过:
type(psd_res.data_frame['Column1'][0])
<class 'tuple'>
type(psd_res.data_frame['Column1'][0][0])
<class 'numpy.float64'>
Run Code Online (Sandbox Code Playgroud)
所以我找到了这个线程https://github.com/dask/fastparquet/issues/458。这似乎是 fastparquet 中的一个错误 - 但它应该在 pyarrow 中工作,这对我来说失败了。
然后我尝试了一些我发现的东西infer_objects()......astype(float)到目前为止没有任何效果。
有谁有解决方案如何将我的数据框保存到镶木地板?
数据框的单元格包含浮点元组。这是一种不寻常的数据类型。
因此,您需要为 arrow 提供一些帮助来确定数据的类型。为此,您需要明确提供表的架构。
df = pd.DataFrame(
{
"column1": [(1.0, 2.0), (3.0, 4.0, 5.0)]
}
)
schema = pa.schema([pa.field('column1', pa.list_(pa.float64()))])
df.to_parquet('/tmp/hello.pq', schema=schema)
Run Code Online (Sandbox Code Playgroud)
请注意,如果您使用浮点数列表(而不是元组),它会起作用:
df = pd.DataFrame(
{
"column1": [[1.0, 2.0], [3.0, 4.0, 5.0]]
}
)
df.to_parquet('/tmp/hello.pq')
Run Code Online (Sandbox Code Playgroud)