熊猫HDFStore：省略重复项

Question

熊猫HDFStore：省略重复项

Cod*_*123 1 python hdf5 hdfs pandas hdfstore

我有一个HDFStore，每晚我都在其中输入数据。我想知道是否系统崩溃等问题，我可能会重新运行进程，所以我想确保如果下一次运行该进程的行已经存在，则熊猫不包括此行。有没有办法寻找重复项而不包含重复项？

Answer 1

如果您的HDFStore中有唯一索引，则可以使用以下方法：

创建样本DF：

In [34]: df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))

In [35]: df
Out[35]:
          a         b         c
0  0.407144  0.972121  0.462502
1  0.044768  0.165924  0.852705
2  0.703686  0.156382  0.066925
3  0.912794  0.362916  0.866779
4  0.156249  0.625272  0.360799

Run Code Online (Sandbox Code Playgroud)

将其保存到HDFStore：

In [36]: store = pd.HDFStore(r'd:/temp/t.h5')

In [37]: store.append('test', df, format='t')

Run Code Online (Sandbox Code Playgroud)

在我们的DF中添加新行：

In [38]: df.loc[len(df)] = [-1, -1, -1]

In [39]: df
Out[39]:
          a         b         c
0  0.407144  0.972121  0.462502
1  0.044768  0.165924  0.852705
2  0.703686  0.156382  0.066925
3  0.912794  0.362916  0.866779
4  0.156249  0.625272  0.360799
5 -1.000000 -1.000000 -1.000000   # new row, which is NOT in the HDF file

Run Code Online (Sandbox Code Playgroud)

选择重复行的索引：

In [40]: idx = store.select('test', where="index in df.index", columns=['index']).index

Run Code Online (Sandbox Code Playgroud)

校验：

In [41]: df.query("index not in @idx")
Out[41]:
     a    b    c
5 -1.0 -1.0 -1.0

Run Code Online (Sandbox Code Playgroud)

仅将尚未保存的行追加到HDFStore：

In [42]: store.append('test', df.query("index not in @idx"), format='t')

Run Code Online (Sandbox Code Playgroud)

校验：

In [43]: store.select('test')
Out[43]:
          a         b         c
0  0.407144  0.972121  0.462502
1  0.044768  0.165924  0.852705
2  0.703686  0.156382  0.066925
3  0.912794  0.362916  0.866779
4  0.156249  0.625272  0.360799
5 -1.000000 -1.000000 -1.000000   # new row has been added

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，9 月前
查看次数：	726 次
最近记录：	8 年，2 月前