如何在泰坦尼克时代列中通过插补填充 NaN 值

Sim*_*n H 1 python pandas scikit-learn sklearn-pandas imputation

我正在参加泰坦尼克号 Kaggle 比赛,目前正在尝试估算缺失Age值。

这个想法是计算训练集上每组的Age平均值,然后使用该信息来替换训练集和测试集。[Pclass, Sex]NaN

这是我到目前为止所拥有的:

meanAgeTrain = train.groupby(['Pclass', 'Sex'])['Age'].transform('mean')
    
for df in [train, test]:
    df['Age'] = df['Age'].fillna(meanAgeTrain)
Run Code Online (Sandbox Code Playgroud)

问题是,这仍然在测试集中留下了一些 NaN 值,同时消除了训练集中的所有 Nan。我认为这与指数有关。

我需要的是:

  1. 计算训练集中每个 Pclass/Sex 组的平均值
  2. 将训练集中的所有 NaN 值映射到正确的均值
  3. 将测试集中的所有 NaN 值映射到正确的平均值(按 Pclass/Sex 查找,而不是基于索引)

如何使用 Pandas 正确完成此操作?

编辑:

感谢您的建议。@Reza 的那个可以工作,但我不能 100% 理解它。所以我正在尝试提出自己的解决方案。

这是可行的,但我是 Pandas 新手,想知道是否有更简单的方法来实现它。

trainMeans = self.train.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()

def f(x):
    if x["Age"] == x["Age"]:  # not NaN
        return x["Age"]
    return trainMeans.loc[(trainMeans["Pclass"] == x["Pclass"]) & (trainMeans["Sex"] == x["Sex"])]["Age"].values[0]

 self.train['Age'] = self.train.apply(f, axis=1)
 self.test['Age'] = self.test.apply(f, axis=1)
Run Code Online (Sandbox Code Playgroud)

尤其是函数中的 if 对我来说看起来不是最佳实践。我需要一种方法将该函数仅应用于 NaN 年龄。

编辑2

事实证明,重置索引会使事情变得更加复杂和缓慢,因为在分组之后,索引已经正是我想要用作映射键的内容。这更快更容易:

trainMeans = self.train.groupby(['Pclass', 'Sex'])['Age'].mean()

def f(x):
    if not np.isnan(x["Age"]):  # not NaN
        return x["Age"]
    return trainMeans[x["Pclass"], x["Sex"]]

self.train['Age'] = self.train.apply(f, axis=1)
self.test['Age'] = self.test.apply(f, axis=1)
Run Code Online (Sandbox Code Playgroud)

这可以进一步简化吗?

Tre*_*ney 5

  • 您将看到这两种填充方法(使用平均值的 groupby fillna随机森林回归器)彼此相差不到一年的 1/100
    • 请参阅答案底部的统计比较。

用平均值填充 nan 值

import pandas as pd
import seaborn as sns

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# Populate Age_Fill
df['Age_Fill'] = df['age'].groupby([df['pclass'], df['sex']]).apply(lambda x: x.fillna(x.mean()))

# series with filled ages
groupby_result = df.Age_Fill[df.age.isnull()]

# display(df[df.age.isnull()].head())
 survived  pclass     sex  age  sibsp  parch     fare embarked   class    who  adult_male deck  embark_town alive  alone  Age_Fill
        0       3    male  NaN      0      0   8.4583        Q   Third    man        True  NaN   Queenstown    no   True  26.50759
        1       2    male  NaN      0      0  13.0000        S  Second    man        True  NaN  Southampton   yes   True  30.74071
        1       3  female  NaN      0      0   7.2250        C   Third  woman       False  NaN    Cherbourg   yes   True  21.75000
        0       3    male  NaN      0      0   7.2250        C   Third    man        True  NaN    Cherbourg    no   True  26.50759
        1       3  female  NaN      0      0   7.8792        Q   Third  woman       False  NaN   Queenstown   yes   True  21.75000
Run Code Online (Sandbox Code Playgroud)

从 RandomForestRegressor 填充 nan 值

from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import seaborn as sns

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# split data
train = df.loc[(df.age.notnull())]  # known age values
test = df.loc[(df.age.isnull())]  # all nan age values

# select age column
y = train.values[:, 3]

# select pclass and sex
X = train.values[:, [1, 2]]

# create RandomForestRegressor model
rfr = RandomForestRegressor(n_estimators=2000, n_jobs=-1)

# Fit a model
rfr.fit(X, y)

# Use the fitted model to predict the missing values
predictedAges = rfr.predict(test.values[:, [1, 2]])

# create predicted age column
df['pred_age'] = df.age

# fill column
df.loc[(df.pred_age.isnull()), 'pred_age'] = predictedAges 

# display(df[df.age.isnull()].head())
 survived  pclass  sex  age  sibsp  parch     fare embarked   class    who  adult_male deck  embark_town alive  alone  pred_age
        0       3    1  NaN      0      0   8.4583        Q   Third    man        True  NaN   Queenstown    no   True  26.49935
        1       2    1  NaN      0      0  13.0000        S  Second    man        True  NaN  Southampton   yes   True  30.73126
        1       3    0  NaN      0      0   7.2250        C   Third  woman       False  NaN    Cherbourg   yes   True  21.76513
        0       3    1  NaN      0      0   7.2250        C   Third    man        True  NaN    Cherbourg    no   True  26.49935
        1       3    0  NaN      0      0   7.8792        Q   Third  woman       False  NaN   Queenstown   yes   True  21.76513
Run Code Online (Sandbox Code Playgroud)

按 rfr 进行分组比较

print(predictedAges - groupby_result).describe())

count    177.00000
mean       0.00362
std        0.01877
min       -0.04167
25%        0.01121
50%        0.01121
75%        0.01131
max        0.02969
Name: Age_Fill, dtype: float64

# comparison dataframe
comp = pd.DataFrame({'rfr': predictedAges.tolist(), 'gb': groupby_result.tolist()})
comp['diff'] = comp.rfr - comp.gb

# display(comp)
      rfr        gb     diff
 26.51880  26.50759  0.01121
 30.69903  30.74071 -0.04167
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 30.69903  30.74071 -0.04167
 41.24592  41.28139 -0.03547
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 41.24592  41.28139 -0.03547
 21.76131  21.75000  0.01131
 30.69903  30.74071 -0.04167
 41.24592  41.28139 -0.03547
 41.24592  41.28139 -0.03547
 41.24592  41.28139 -0.03547
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 28.75266  28.72297  0.02969
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 30.69903  30.74071 -0.04167
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 30.69903  30.74071 -0.04167
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 30.69903  30.74071 -0.04167
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 30.69903  30.74071 -0.04167
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 28.75266  28.72297  0.02969
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 30.69903  30.74071 -0.04167
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 30.69903  30.74071 -0.04167
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 41.24592  41.28139 -0.03547
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 41.24592  41.28139 -0.03547
 26.51880  26.50759  0.01121
 34.63090  34.61176  0.01913
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
 26.51880  26.50759  0.01121
 26.51880  26.50759  0.01121
 21.76131  21.75000  0.01131
Run Code Online (Sandbox Code Playgroud)

计算随机训练集的平均值

  • 此示例计算随机训练集的平均值,然后填充nan训练集和测试集中的值
  • 使用pandas.DataFrame.fillna,当两个数据帧具有匹配的索引并且填充列相同时,它将从另一个数据帧填充数据帧列中的缺失值。
    • Pclass/Sex 并不基于索引pclass而是sex设置为索引,这就是.fillna工作原理。
  • 在此示例中,train是数据的 67%,test是数据的 33%。
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split

# load dataset
df = sns.load_dataset('titanic')

# map sex to a numeric type
df.sex = df.sex.map({'male': 1, 'female': 0})

# randomly split the dataframe into a train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# select columns for X and y
X = df[['pclass', 'sex']]
y = df['age']

# create a dataframe of train (X, y) and test (X, y)
train = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
test = pd.concat([X_test, y_test], axis=1).reset_index(drop=True)

# calculate means for train
train_means = train.groupby(['pclass', 'sex']).agg({'age': 'mean'})

# display train_means, a multi-index dataframe
                 age
pclass sex          
1      0    34.66667
       1    41.38710
2      0    27.90217
       1    30.50000
3      0    21.56338
       1    26.87163

# fill nan values in train
train = train.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()

# fill nan values in test
test = test.set_index(['pclass', 'sex']).age.fillna(train_means.age).reset_index()
Run Code Online (Sandbox Code Playgroud)