Ada*_*dam 9 python dataframe pandas
我有一个包含两列的数据框,A和B.在这种情况下,顺序A和顺序B并不重要; 例如,我会考虑(0,50)并(50,0)重复.在pandas中,从数据框中删除这些重复项的有效方法是什么?
import pandas as pd
# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50],
'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
A B
0 0 50
1 10 22
2 11 35
3 21 5
4 22 10
5 35 11
6 5 21
7 50 0
# Desired output with "duplicates" removed.
data2 = pd.DataFrame({'A': [0, 5, 10, 11],
'B': [50, 21, 22, 35]})
data2
A B
0 0 50
1 5 21
2 10 22
3 11 35
Run Code Online (Sandbox Code Playgroud)
理想情况下,输出将按列的值排序A.
Psi*_*dom 10
您可以在删除重复项之前对数据框的每一行进行排序:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21
Run Code Online (Sandbox Code Playgroud)
如果您希望按列对结果进行排序A:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
Run Code Online (Sandbox Code Playgroud)
这有点丑陋,但更快的解决方案:
In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
A B
0 0 50
1 10 22
2 11 35
3 5 21
Run Code Online (Sandbox Code Playgroud)
时间:适用于8K行DF
In [50]: big = pd.concat([data] * 10**3, ignore_index=True)
In [51]: big.shape
Out[51]: (8000, 2)
In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop
In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop
In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1288 次 |
| 最近记录: |