无法从泰坦尼克号竞赛中将字符串转换为浮点错误

Onu*_*bek 5 python numpy machine-learning pandas scikit-learn

我正试图解决Kaggle的泰坦尼克号生存计划.这是我实际学习机器学习的第一步.我有一个问题,性别列导致错误.堆栈跟踪说could not convert string to float: 'female'.你们是怎么遇到这个问题的?我不想要解决方案.我只是想要一个实用的方法解决这个问题,因为我确实需要性别列来构建我的模型.

这是我的代码:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

train_path = "C:\\Users\\Omar\\Downloads\\Titanic Data\\train.csv"
train_data = pd.read_csv(train_path)
columns_of_interest = ['Survived','Pclass', 'Sex', 'Age']
filtered_titanic_data = train_data.dropna(axis=0)

x = filtered_titanic_data[columns_of_interest]
y = filtered_titanic_data.Survived

train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=0)

titanic_model = DecisionTreeRegressor()
titanic_model.fit(train_x, train_y)

val_predictions = titanic_model.predict(val_x)
print(filtered_titanic_data)
Run Code Online (Sandbox Code Playgroud)

sac*_*cuL 8

有几种方法可以解决这个问题,这取决于你在寻找什么:

  1. 你可以编码您的类别数值,改变你的类别的每个级别有不同的号,

要么

  1. 伪代码您的类别,把你的类别的每个级别到一个单独的列,它得到的值0或者1.

在许多机器学习应用程序中,处理虚拟代码的因素更好.

注意,在2级类别的情况下,根据下面概述的方法编码为数字基本上等同于虚拟编码:所有非级别的值必须是0级别的1.实际上,在下面给出的虚拟代码示例中,有冗余信息,因为我已经给出了两个类中的每个类都有自己的列.这只是为了说明这个概念.通常,人们只会创建n-1列,其中n是级别的数量,并且暗示了省略的级别(即,为列创建一个列Female,并0隐含所有值Male).

将类别编码为数字:

方法1: pd.factorize

pd.factorize 是一种简单,快速的数字编码方式:

例如,如果您的列gender如下所示:

>>> df
   gender
0  Female
1    Male
2    Male
3    Male
4  Female
5  Female
6    Male
7  Female
8  Female
9  Female

df['gender_factor'] = pd.factorize(df.gender)[0]

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0
Run Code Online (Sandbox Code Playgroud)

方法2:categoricaldtype

另一种方法是使用categorydtype:

df['gender_factor'] = df['gender'].astype('category').cat.codes
Run Code Online (Sandbox Code Playgroud)

这将导致相同的输出

方法3 sklearn.preprocessing.LabelEncoder()

这种方法带有一些奖励,例如简单的反向转换:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()

# Transform the gender column
df['gender_factor'] = le.fit_transform(df.gender)

>>> df
   gender  gender_factor
0  Female              0
1    Male              1
2    Male              1
3    Male              1
4  Female              0
5  Female              0
6    Male              1
7  Female              0
8  Female              0
9  Female              0

# Easy to back transform:

df['gender_factor'] = le.inverse_transform(df.gender_factor)

>>> df
   gender gender_factor
0  Female        Female
1    Male          Male
2    Male          Male
3    Male          Male
4  Female        Female
5  Female        Female
6    Male          Male
7  Female        Female
8  Female        Female
9  Female        Female
Run Code Online (Sandbox Code Playgroud)

虚拟编码:

方法1: pd.get_dummies

df.join(pd.get_dummies(df.gender))

   gender  Female  Male
0  Female       1     0
1    Male       0     1
2    Male       0     1
3    Male       0     1
4  Female       1     0
5  Female       1     0
6    Male       0     1
7  Female       1     0
8  Female       1     0
9  Female       1     0
Run Code Online (Sandbox Code Playgroud)

注意,如果您想省略一列以获得非冗余的虚拟代码(请参阅本答案开头的注释),您可以使用:

df.join(pd.get_dummies(df.gender, drop_first=True))

   gender  Male
0  Female     0
1    Male     1
2    Male     1
3    Male     1
4  Female     0
5  Female     0
6    Male     1
7  Female     0
8  Female     0
9  Female     0
Run Code Online (Sandbox Code Playgroud)