我正在处理一个充满选举数据的csv文件.我的原始样本可以表示为:
city party1 party2 party3
0 city1 50 107 114
1 city2 181 323 326
2 city3 26 28 75
3 city4 32 47 59
4 ciy5 8 21 21
Run Code Online (Sandbox Code Playgroud)
我使用pandas的idxmax()函数来创建一个名为"winner"的新列,如下所示:
mydf['winner'] = mydf[['party1','party2','party3']].idxmax(axis=1)
Run Code Online (Sandbox Code Playgroud)
我的目标是确定哪个政党在每个城市中处于第一位.结果如下:
city party1 party2 party3 winner
0 city1 50 107 114 party3
1 city2 181 323 326 party3
2 city3 26 28 75 party3
3 city4 32 47 59 party3
4 ciy5 8 21 21 party2
Run Code Online (Sandbox Code Playgroud)
获胜者的最后一个原始值是假的,因为party2和party3具有相同的分数.
是否可以在函数中包含一个异常,idxmax考虑两个值的相等性并给出"等式"?
您可以使用与每行的值DataFrame.eq进行比较,然后将它们与更高的值进行比较subset,因为最多有重复项.所以后来可以是重写值由用面膜:DataFrame.maxsum1idxmaxmasks > 1
a = mydf[['party1','party2','party3']]
mydf['winner'] = a.idxmax(axis=1)
s = a.eq(a.max(axis=1), axis=0).sum(axis=1)
print (s)
0 1
1 1
2 1
3 1
4 2
dtype: int64
mydf['winner'] = mydf['winner'].mask(s > 1, 'Equality')
print (mydf)
city party1 party2 party3 winner
0 city1 50 107 114 party3
1 city2 181 323 326 party3
2 city3 26 28 75 party3
3 city4 32 47 59 party3
4 ciy5 8 21 21 Equality
Run Code Online (Sandbox Code Playgroud)
如果需要还多值df按列的值通过mul,然后apply join和最后删除,的strip:
a = mydf[['party1','party2','party3']]
df = a.eq(a.max(axis=1), axis=0)
print (df)
party1 party2 party3
0 False False True
1 False False True
2 False False True
3 False False True
4 False True True
mydf['winner'] = df.mul(df.columns.to_series())
.apply(','.join, axis=1)
.str.strip(',')
print (mydf)
city party1 party2 party3 winner
0 city1 50 107 114 party3
1 city2 181 323 326 party3
2 city3 26 28 75 party3
3 city4 32 47 59 party3
4 ciy5 8 21 21 party2,party3
Run Code Online (Sandbox Code Playgroud)