Sim*_*n H 1 python pandas scikit-learn sklearn-pandas imputation
我正在参加泰坦尼克号 Kaggle 比赛,目前正在尝试估算缺失Age值。
这个想法是计算训练集上每组的Age平均值,然后使用该信息来替换训练集和测试集。[Pclass, Sex]NaN
这是我到目前为止所拥有的:
meanAgeTrain = train.groupby(['Pclass', 'Sex'])['Age'].transform('mean')
for df in [train, test]:
df['Age'] = df['Age'].fillna(meanAgeTrain)
Run Code Online (Sandbox Code Playgroud)
问题是,这仍然在测试集中留下了一些 NaN 值,同时消除了训练集中的所有 Nan。我认为这与指数有关。
我需要的是:
如何使用 Pandas 正确完成此操作?
编辑:
感谢您的建议。@Reza 的那个可以工作,但我不能 100% 理解它。所以我正在尝试提出自己的解决方案。
这是可行的,但我是 Pandas 新手,想知道是否有更简单的方法来实现它。
trainMeans = self.train.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
def f(x):
if x["Age"] == x["Age"]: # not NaN
return x["Age"]
return trainMeans.loc[(trainMeans["Pclass"] == x["Pclass"]) & (trainMeans["Sex"] == x["Sex"])]["Age"].values[0]
self.train['Age'] = self.train.apply(f, axis=1)
self.test['Age'] = self.test.apply(f, axis=1)
Run Code Online (Sandbox Code Playgroud)
尤其是函数中的 if 对我来说看起来不是最佳实践。我需要一种方法将该函数仅应用于 NaN 年龄。
编辑2:
事实证明,重置索引会使事情变得更加复杂和缓慢,因为在分组之后,索引已经正是我想要用作映射键的内容。这更快更容易:
trainMeans = self.train.groupby(['Pclass', 'Sex'])['Age'].mean()
def f(x):
if not np.isnan(x["Age"]): # not NaN
return x["Age"]
return trainMeans[x["Pclass"], x["Sex"]]
self.train['Age'] = self.train.apply(f, axis=1)
self.test['Age'] = self.test.apply(f, axis=1)
Run Code Online (Sandbox Code Playgroud)
这可以进一步简化吗?
import pandas as pd
import seaborn as sns
# load dataset
df = sns.load_dataset('titanic')
# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})
# Populate Age_Fill
df['Age_Fill'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))
# series with filled ages
groupby_result = df.Age_Fill[df.age.isnull()]
# display(df[df.age.isnull()].head())
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone Age_Fill
0 3 male NaN 0 0 8.4583 Q Third man True NaN Queenstown no True 26.50759
1 2 male NaN 0 0 13.0000 S Second man True NaN Southampton yes True 30.74071
1 3 female NaN 0 0 7.2250 C Third woman False NaN Cherbourg yes True 21.75000
0 3 male NaN 0 0 7.2250 C Third man True NaN Cherbourg no True 26.50759
1 3 female NaN 0 0 7.8792 Q Third woman False NaN Queenstown yes True 21.75000
Run Code Online (Sandbox Code Playgroud)
sklearn.ensemble.RandomForestRegressorfrom sklearn.ensemble import RandomForestRegressor
import pandas as pd
import seaborn as sns
# load dataset
df = sns.load_dataset('titanic')
# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})
# split data
train = df.loc[(df.age.notnull())] # known age values
test = df.loc[(df.age.isnull())] # all nan age values
# select age column
y = train.values[:, 3]
# select pclass and sex
X = train.values[:, [1, 2]]
# create RandomForestRegressor model
rfr = RandomForestRegressor(n_estimators=2000, n_jobs=-1)
# Fit a model
rfr.fit(X, y)
# Use the fitted model to predict the missing values
predictedAges = rfr.predict(test.values[:, [1, 2]])
# create predicted age column
df['pred_age'] = df.age
# fill column
df.loc[(df.pred_age.isnull()), 'pred_age'] = predictedAges
# display(df[df.age.isnull()].head())
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone pred_age
0 3 1 NaN 0 0 8.4583 Q Third man True NaN Queenstown no True 26.49935
1 2 1 NaN 0 0 13.0000 S Second man True NaN Southampton yes True 30.73126
1 3 0 NaN 0 0 7.2250 C Third woman False NaN Cherbourg yes True 21.76513
0 3 1 NaN 0 0 7.2250 C Third man True NaN Cherbourg no True 26.49935
1 3 0 NaN 0 0 7.8792 Q Third woman False NaN Queenstown yes True 21.76513
Run Code Online (Sandbox Code Playgroud)
print(predictedAges - groupby_result).describe())
count 177.00000
mean 0.00362
std 0.01877
min -0.04167
25% 0.01121
50% 0.01121
75% 0.01131
max 0.02969
Name: Age_Fill, dtype: float64
# comparison dataframe
comp = pd.DataFrame({'rfr': predictedAges.tolist(), 'gb': groupby_result.tolist()})
comp['diff'] = comp.rfr - comp.gb
# display(comp)
rfr gb diff
26.51880 26.50759 0.01121
30.69903 30.74071 -0.04167
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
30.69903 30.74071 -0.04167
41.24592 41.28139 -0.03547
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
41.24592 41.28139 -0.03547
21.76131 21.75000 0.01131
30.69903 30.74071 -0.04167
41.24592 41.28139 -0.03547
41.24592 41.28139 -0.03547
41.24592 41.28139 -0.03547
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
28.75266 28.72297 0.02969
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
30.69903 30.74071 -0.04167
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
30.69903 30.74071 -0.04167
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
30.69903 30.74071 -0.04167
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
30.69903 30.74071 -0.04167
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
28.75266 28.72297 0.02969
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
30.69903 30.74071 -0.04167
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
30.69903 30.74071 -0.04167
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
41.24592 41.28139 -0.03547
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
41.24592 41.28139 -0.03547
26.51880 26.50759 0.01121
34.63090 34.61176 0.01913
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
26.51880 26.50759 0.01121
26.51880 26.50759 0.01121
21.76131 21.75000 0.01131
Run Code Online (Sandbox Code Playgroud)
nan训练集和测试集中的值pandas.DataFrame.fillna,当两个数据帧具有匹配的索引并且填充列相同时,它将从另一个数据帧填充数据帧列中的缺失值。
pclass而是sex设置为索引,这就是.fillna工作原理。train是数据的 67%,test是数据的 33%。
test_size并且train_size可以根据需要设置sklearn.model_selection.train_test_splitimport pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
# load dataset
df = sns.load_dataset('titanic')
# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})
# randomly split the dataframe into a train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# select columns for X and y
X = df[['pclass', 'sex']]
y = df['age']
# create a dataframe of train (X, y) and test (X, y)
train = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
test = pd.concat([X_test, y_test], axis=1).reset_index(drop=True)
# calculate means for train
train_means = train.groupby(['pclass', 'sex']).agg({'age': 'mean'})
# display train_means, a multi-index dataframe
age
pclass sex
1 0 34.66667
1 41.38710
2 0 27.90217
1 30.50000
3 0 21.56338
1 26.87163
# fill nan values in train
train = train.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()
# fill nan values in test
test = test.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4896 次 |
| 最近记录: |