Pandas 类似于 SQL MINUS / EXCEPT 运算符,使用多列

Max*_*axU 4 python sql dataframe pandas

我正在寻找最快且惯用的SQL 减号(又称为例外)运算符的模拟。

这就是我的意思 - 给定两个 Pandas DataFrame,如下所示:

In [77]: d1
Out[77]:
   a  b  c
0  0  0  1
1  0  1  2
2  1  0  3
3  1  1  4
4  0  0  5
5  1  1  6
6  2  2  7

In [78]: d2
Out[78]:
   a  b   c
0  1  1  10
1  0  0  11
2  1  1  12
Run Code Online (Sandbox Code Playgroud)

d1 MINUS d2如何找到仅考虑列的结果"a""b"获得以下结果:

In [62]: res
Out[62]:
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7
Run Code Online (Sandbox Code Playgroud)

MVCE:

d1 = pd.DataFrame({
    'a': [0, 0, 1, 1, 0, 1, 2], 
    'b': [0, 1, 0, 1, 0, 1, 2], 
    'c': [1, 2, 3, 4, 5, 6, 7]
})

d2 = pd.DataFrame({
    'a': [1, 0, 1], 
    'b': [1, 0, 1], 
    'c': [10, 11, 12]
})
Run Code Online (Sandbox Code Playgroud)

我尝试过什么:

In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])

In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)

In [67]: res = d1.loc[tmp1.loc[idx, "index"]]

In [68]: res
Out[68]:
   a  b  c
1  0  1  2
2  1  0  3
6  2  2  7
Run Code Online (Sandbox Code Playgroud)

它给了我正确的结果,但我有一种感觉,必须有一种更惯用、更好/更干净的方法来实现这一点。

PS DataFrame.isin()方法在这种情况下没有帮助,因为它会产生错误的结果集

Max*_*axU 6

较大数据集的执行时间比较:

\n\n
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)\n\nIn [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)\n\nIn [102]: df1.shape\nOut[102]: (700000, 3)\n\nIn [103]: df2.shape\nOut[103]: (300000, 3)\n
Run Code Online (Sandbox Code Playgroud)\n\n

pd.concat().drop_duplicates()方法:

\n\n
In [10]: %%timeit\n    ...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates([\'a\', \'b\'], keep=False)\n    ...:\n    ...:\n2.59 ms \xc2\xb1 129 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

多索引 NOT IS IN 方法:

\n\n
In [11]: %%timeit\n    ...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]\n    ...:\n    ...:\n484 ms \xc2\xb1 18.6 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

多指标差分法:

\n\n
In [12]: %%timeit\n    ...: tmp1 = df1.reset_index().set_index(["a", "b"])\n    ...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)\n    ...: res = df1.loc[tmp1.loc[idx, "index"]]\n    ...:\n    ...:\n1.04 s \xc2\xb1 20.7 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n\n

merge(how="outer")方法 - 给了我一个MemoryError

\n\n
In [106]: %%timeit\n     ...: res =  (df1.reset_index()\n     ...:         .merge(df2, on=[\'a\',\'b\'], indicator=True, how=\'outer\', suffixes=(\'\',\'_\'))\n     ...:         .query(\'_merge == "left_only"\')\n     ...:         .set_index(\'index\')\n     ...:         .rename_axis(None)\n     ...:         .reindex(df1.columns, axis=1))\n     ...:\n     ...:\n---------------------------------------------------------------------------\nMemoryError                               Traceback (most recent call last)\n
Run Code Online (Sandbox Code Playgroud)\n\n

比较连接字符串方法:

\n\n
In [13]: %%timeit\n    ...: res = df1[~df1[[\'a\',\'b\']].astype(str).sum(axis=1).isin(df2[[\'a\',\'b\']].astype(str).sum(axis=1))]\n    ...:\n    ...:\n2.05 s \xc2\xb1 65.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\n
Run Code Online (Sandbox Code Playgroud)\n