我有关于HDF5性能和并发性的以下问题:
参考文献:
我想将具有不同列的dataFrame存储到hdf5文件中(查找下面带有数据类型的摘录).
In [1]: mydf
Out [1]:
endTime uint32
distance float16
signature category
anchorName category
stationList object
Run Code Online (Sandbox Code Playgroud)
在转换一些列(我上面的摘录中的signature和anchorName)之前,我使用了类似下面的代码来存储它(它工作得非常好):
path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', complevel=9, complib='bzip2')
Run Code Online (Sandbox Code Playgroud)
但它不适用于类别,然后我尝试了以下:
path = 'tmp4.hdf5'
key = 'journeys'
mydf.to_hdf(path, key, mode='w', format='t', complevel=9, complib='bzip2')
Run Code Online (Sandbox Code Playgroud)
它工作正常,如果我删除列stationList,其中每个条目是一个字符串列表.但是在本专栏中,我得到以下异常:
Cannot serialize the column [stationList] because
its data contents are [mixed] object dtype
Run Code Online (Sandbox Code Playgroud)
如何改进我的代码以存储数据框?
pandas版本:0.17.1
python版本:2.7.6(由于兼容性原因无法更改)
edit1(一些示例代码):
import pandas as pd
mydf = pd.DataFrame({'endTime' : pd.Series([1443525810,1443540836,1443609470]),
'distance' : pd.Series([454.75,477.25,242.12]),
'signature' : pd.Series(['ab','cd','ab']),
'anchorName' : pd.Series(['tec','ing','pol']),
'stationList' : pd.Series([['t1','t2','t3'],['4','t2','t3'],['t3','t2','t4']]) …Run Code Online (Sandbox Code Playgroud) 我正在使用当前使用大(> 5GB).csv文件操作的系统.为了提高性能,我正在测试(A)从磁盘创建数据帧的不同方法(pandas VS dask)以及(B)将结果存储到磁盘的不同方法(.csv VS hdf5文件).
为了衡量绩效,我做了以下几点:
def dask_read_from_hdf():
results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique()
hdf.close()
def pandas_read_from_hdf():
results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_pd_hdf = results_pd_hdf.Security.unique()
hdf.close()
def dask_read_from_csv():
results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_dd_csv = results_dd_csv.Security.unique()
def pandas_read_from_csv():
results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_pd_csv = results_pd_csv.Security.unique()
print "dask hdf performance"
%timeit …Run Code Online (Sandbox Code Playgroud)