Aya*_*lam 5 python dataframe pandas
我有一个熊猫数据框
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
Run Code Online (Sandbox Code Playgroud)
如果状态出现在之前,我想根据最频繁的状态填充 NaN,因此我按状态分组并应用以下代码:
df['City'] = df.groupby('State').transform(lambda x:x.fillna(x.value_counts().idxmax()))
Run Code Online (Sandbox Code Playgroud)
上面的代码适用于如果所有状态都发生在输出之前
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
Run Code Online (Sandbox Code Playgroud)
但是我想添加一个条件,以便如果一个状态永远不会发生,它的城市将是整个城市列中最常见的,即如果数据框是
City State
0 Cambridge MA
1 NaN DC
2 Boston MA
3 Washignton DC
4 NaN MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 NaN FL
11 Washington DC
12 NaN NY
Run Code Online (Sandbox Code Playgroud)
NY 在我想要输出之前从未发生过
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY
Run Code Online (Sandbox Code Playgroud)
上面的代码给出了一个 ValueError: ('attempt to get argmax of an empty sequence') 因为“NY”以前从未发生过。
您可以通过以下代码解决这个问题
mode = df['City'].mode()[0]
df['City'] = df.groupby('State')['City'].apply(lambda x: x.fillna(x.value_counts().idxmax() if x.value_counts().max() >=1 else mode , inplace = False))
df['City']= df['City'].fillna(df['City'].value_counts().idxmax())
Run Code Online (Sandbox Code Playgroud)
输出:
City State
0 Cambridge MA
1 Washignton DC
2 Boston MA
3 Washignton DC
4 Cambridge MA
5 Tampa FL
6 Danvers MA
7 Miami FL
8 Cambridge MA
9 Miami FL
10 Miami FL
11 Washington DC
12 Cambridge NY
Run Code Online (Sandbox Code Playgroud)