Max*_*axU 4 python sql dataframe pandas
我正在寻找最快且惯用的SQL 减号(又称为例外)运算符的模拟。
这就是我的意思 - 给定两个 Pandas DataFrame,如下所示:
In [77]: d1
Out[77]:
a b c
0 0 0 1
1 0 1 2
2 1 0 3
3 1 1 4
4 0 0 5
5 1 1 6
6 2 2 7
In [78]: d2
Out[78]:
a b c
0 1 1 10
1 0 0 11
2 1 1 12
Run Code Online (Sandbox Code Playgroud)
d1 MINUS d2如何找到仅考虑列的结果"a"并"b"获得以下结果:
In [62]: res
Out[62]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Run Code Online (Sandbox Code Playgroud)
MVCE:
d1 = pd.DataFrame({
'a': [0, 0, 1, 1, 0, 1, 2],
'b': [0, 1, 0, 1, 0, 1, 2],
'c': [1, 2, 3, 4, 5, 6, 7]
})
d2 = pd.DataFrame({
'a': [1, 0, 1],
'b': [1, 0, 1],
'c': [10, 11, 12]
})
Run Code Online (Sandbox Code Playgroud)
我尝试过什么:
In [65]: tmp1 = d1.reset_index().set_index(["a", "b"])
In [66]: idx = tmp1.index.difference(d2.set_index(["a","b"]).index)
In [67]: res = d1.loc[tmp1.loc[idx, "index"]]
In [68]: res
Out[68]:
a b c
1 0 1 2
2 1 0 3
6 2 2 7
Run Code Online (Sandbox Code Playgroud)
它给了我正确的结果,但我有一种感觉,必须有一种更惯用、更好/更干净的方法来实现这一点。
PS DataFrame.isin()方法在这种情况下没有帮助,因为它会产生错误的结果集
In [100]: df1 = pd.concat([d1] * 10**5, ignore_index=True)\n\nIn [101]: df2 = pd.concat([d2] * 10**5, ignore_index=True)\n\nIn [102]: df1.shape\nOut[102]: (700000, 3)\n\nIn [103]: df2.shape\nOut[103]: (300000, 3)\nRun Code Online (Sandbox Code Playgroud)\n\npd.concat().drop_duplicates()方法:In [10]: %%timeit\n ...: res = pd.concat([d1, pd.concat([d2]*2)]).drop_duplicates([\'a\', \'b\'], keep=False)\n ...:\n ...:\n2.59 ms \xc2\xb1 129 \xc2\xb5s per loop (mean \xc2\xb1 std. dev. of 7 runs, 100 loops each)\nRun Code Online (Sandbox Code Playgroud)\n\nIn [11]: %%timeit\n ...: res = df1[~df1.set_index(["a", "b"]).index.isin(df2.set_index(["a","b"]).index)]\n ...:\n ...:\n484 ms \xc2\xb1 18.6 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n\nIn [12]: %%timeit\n ...: tmp1 = df1.reset_index().set_index(["a", "b"])\n ...: idx = tmp1.index.difference(df2.set_index(["a","b"]).index)\n ...: res = df1.loc[tmp1.loc[idx, "index"]]\n ...:\n ...:\n1.04 s \xc2\xb1 20.7 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n\nmerge(how="outer")方法 - 给了我一个MemoryError:In [106]: %%timeit\n ...: res = (df1.reset_index()\n ...: .merge(df2, on=[\'a\',\'b\'], indicator=True, how=\'outer\', suffixes=(\'\',\'_\'))\n ...: .query(\'_merge == "left_only"\')\n ...: .set_index(\'index\')\n ...: .rename_axis(None)\n ...: .reindex(df1.columns, axis=1))\n ...:\n ...:\n---------------------------------------------------------------------------\nMemoryError Traceback (most recent call last)\nRun Code Online (Sandbox Code Playgroud)\n\nIn [13]: %%timeit\n ...: res = df1[~df1[[\'a\',\'b\']].astype(str).sum(axis=1).isin(df2[[\'a\',\'b\']].astype(str).sum(axis=1))]\n ...:\n ...:\n2.05 s \xc2\xb1 65.2 ms per loop (mean \xc2\xb1 std. dev. of 7 runs, 1 loop each)\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
4774 次 |
| 最近记录: |