如何将大于RAM限制的gzip文件导入Pandas DataFrame？"杀9"使用HDF5？

Question

如何将大于RAM限制的gzip文件导入Pandas DataFrame？"杀9"使用HDF5？

Jia*_*ang 3 python gzip hdf5 dataframe pandas

我有一个gzip约90 GB.这完全在磁盘空间内,但远大于RAM.

如何将其导入到pandas数据框中？我在命令行中尝试了以下内容:

# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip'   # size 90 GB
df = read_table(filename, compression='gzip')

Run Code Online (Sandbox Code Playgroud)

然而,几分钟后,Python关闭了Kill 9.

定义数据库对象后df,我计划将其保存到HDF5中.

这样做的正确方法是什么？我pandas.read_table()该怎么用呢？

Answer 1

Max*_*axU 9

我这样做:

filename = 'filename.gzip'      # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6               # you may want to adjust it ... 

store = pd.HDFStore(hdf_fn)

for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
    # don't index data columns in each iteration - we'll do it later
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，7 月前
查看次数：	504 次
最近记录：	9 年，5 月前