我正在尝试使用 Pandas to_parquet 保存一个非常大的数据集,当超过某个限制时,它似乎失败了,无论是“pyarrow”还是“fastparquet”。我使用以下代码重现了我遇到的错误,并且很高兴听到有关如何克服该问题的想法:
使用 Pyarrow:
low = 3
high = 8
for n in np.logspace(low, high, high-low+1):
t0 = time()
df = pd.DataFrame.from_records([(f'ind_{x}', ''.join(['x']*50)) for x in range(int(n))], columns=['a', 'b']).set_index('a')
df.to_parquet(tmp_file, engine='pyarrow', compression='gzip')
pd.read_parquet(tmp_file, engine='pyarrow')
print(f'10^{np.log10(int(n))} read-write took {time()-t0} seconds')
10^3.0 read-write took 0.012851715087890625 seconds
10^4.0 read-write took 0.05722832679748535 seconds
10^5.0 read-write took 0.46846866607666016 seconds
10^6.0 read-write took 4.4494054317474365 seconds
10^7.0 read-write took 43.0602171421051 seconds
---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-51-cad917a26b91> in <module>()
5 df = pd.DataFrame.from_records([(f'ind_{x}', …Run Code Online (Sandbox Code Playgroud)