熊猫:加快最小值提取

dci*_*llo 0 python sorting performance pandas

我有一个巨大的数据框(~10,000,000行),如下所示:

import pandas as pd
import numpy as np
col1 = ['A', 'C', 'D', 'D', 'D']
col2 = ['B', 'A', 'B', 'C', 'A']
col3 = [14, 36, 5, 12, 96]
df = pd.DataFrame(np.column_stack([col1, col2, col3]),
                  columns=['col1','col2','col3'])
df['col3'] = df['col3'].astype(int)


  col1 col2  col3
0    A    B    14
1    C    A    36
2    D    B     5
3    D    C    12
4    D    A    96
Run Code Online (Sandbox Code Playgroud)

我想找到与每个唯一术语(A,B,C,D)相关的最小值:

A B 14
D B 5
C D 12
D B 5
Run Code Online (Sandbox Code Playgroud)

我试过以下但是太慢了:

for i in ['A', 'B', 'C', 'D']:
   dm = df.loc[(df['col1'] == i) | (df['col2'] == i)]
   print dm.ix[dm['col3'].idxmin()]
Run Code Online (Sandbox Code Playgroud)

有什么建议?

ayh*_*han 5

您可以使用melt转到long格式并使用groupby.min:

pd.melt(df, id_vars=['col3']).groupby('value')['col3'].min()
Out: 
value
A    14
B     5
C    12
D     5
Name: col3, dtype: int64
Run Code Online (Sandbox Code Playgroud)