Geo*_*nko 5 python data-analysis dataframe pandas
问题:让我们从 Kaggle 中获取 Titanic 数据集。我有包含“Pclass”、“Sex”和“Age”列的数据框。我需要在“年龄”列中用某个组的中位数填充 NaN。如果是 1st class 的女性,我想用 1st class 女性的中位数填充她的年龄,而不是整个 Age 列的中位数。
问题是如何在某个切片中进行这种更改?
我试过:
data['Age'][(data['Sex'] == 'female')&(data['Pclass'] == 1)&(data['Age'].isnull())].fillna(median)
Run Code Online (Sandbox Code Playgroud)
“中位数”是我的价值,但没有任何变化“就地=真”没有帮助。
非常感谢!
我相信您需要按掩码过滤并分配回:
data = pd.DataFrame({'a':list('aaaddd'),
'Sex':['female','female','male','female','female','male'],
'Pclass':[1,2,1,2,1,1],
'Age':[40,20,30,20,np.nan,np.nan]})
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 NaN 1 female d
5 NaN 1 male d
#boolean mask
mask1 = (data['Sex'] == 'female')&(data['Pclass'] == 1)
#get median by mask without NaNs
med = data.loc[mask1, 'Age'].median()
print (med)
40.0
#repalce NaNs
data.loc[mask1, 'Age'] = data.loc[mask1, 'Age'].fillna(med)
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
Run Code Online (Sandbox Code Playgroud)
什么是相同的:
mask2 = mask1 &(data['Age'].isnull())
data.loc[mask2, 'Age'] = med
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 NaN 1 male d
Run Code Online (Sandbox Code Playgroud)
编辑:
如果需要用NaN中位数替换所有组:
data['Age'] = data.groupby(["Sex","Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
print (data)
Age Pclass Sex a
0 40.0 1 female a
1 20.0 2 female a
2 30.0 1 male a
3 20.0 2 female d
4 40.0 1 female d
5 30.0 1 male d
Run Code Online (Sandbox Code Playgroud)