如何在python中将大型csv文件写入hdf5？

Question

如何在python中将大型csv文件写入hdf5？

我的数据集太大而无法直接读入内存.而且我不想升级机器.根据我的读数,HDF5可能是我的问题的合适解决方案.但我不知道如何迭代地将数据帧写入HDF5文件,因为我无法将csv文件作为数据帧对象加载.

所以我的问题是如何使用python pandas将大型CSV文件写入HDF5文件.

Answer 1

您可以使用chunksize参数以块的形式读取CSV文件,并将每个块附加到HDF文件:

hdf_key = 'hdf_key'
df_cols_to_index = [...] # list of columns (labels) that should be indexed
store = pd.HDFStore(hdf_filename)

for chunk in pd.read_csv(csv_filename, chunksize=500000):
    # don't index data columns in each iteration - we'll do it later ...
    store.append(hdf_key, chunk, data_columns=df_cols_to_index, index=False)
    # index data columns in HDFStore

store.create_table_index(hdf_key, columns=df_cols_to_index, optlevel=9, kind='full')
store.close()

Run Code Online (Sandbox Code Playgroud)

@G_KOBELIEF 请说明故障是如何呈现的。谢谢！ (2认同)

归档时间：	7 年，11 月前
查看次数：	2414 次
最近记录：	7 年，11 月前