ValueError:X 有 10 个特征,但 DecisionTreeClassifier 期望 11 个特征作为输入

1 python machine-learning scikit-learn

使用的数据集是kaggle titanic

错误显示在第 9 个单元格中,其余单元格本身运行良好

在[1]中:

import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

train_data   = pd.read_csv('train.csv')
test_data  = pd.read_csv('test.csv')
Run Code Online (Sandbox Code Playgroud)

在[2]中:

train_data.dtypes
Run Code Online (Sandbox Code Playgroud)

在[3]中:

train_data.isna().sum()
Run Code Online (Sandbox Code Playgroud)

在[4]中:

train_data = train_data.fillna(value = {'Age' :0, 'Embarked' :'u'})
Run Code Online (Sandbox Code Playgroud)

在[5]中:

train_data.isna().sum()
Run Code Online (Sandbox Code Playgroud)

在[6]中:

train_data.shape
Run Code Online (Sandbox Code Playgroud)

在[7]中:

test_data = test_data.fillna(value = {'Age' :0, 'Fare' :0})
Run Code Online (Sandbox Code Playgroud)

在[8]中:

test_data.shape
Run Code Online (Sandbox Code Playgroud)

在 [9] 中:就像在这个单元格中一样,我已经指定了要使用的特征,但为什么它说分类器需要 11 个特征

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Embarked"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")
Run Code Online (Sandbox Code Playgroud)

错误回溯

ValueError                                Traceback (most recent call last) <ipython-input-11-a7ceba9b896f> in <module>
      7 model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=1)
      8 model.fit(X, y)
----> 9 predictions = model.predict(X_test)
     10 
     11 output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
    628             The predicted classes.
    629         """
--> 630         proba = self.predict_proba(X)
    631 
    632         if self.n_outputs_ == 1:

c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
    672         check_is_fitted(self)
    673         # Check data
--> 674         X = self._validate_X_predict(X)
    675 
    676         # Assign chunk of trees to jobs

c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in
_validate_X_predict(self, X)
    420         check_is_fitted(self)
    421 
--> 422         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    423 
    424     @property

c:\python39\lib\site-packages\sklearn\tree\_classes.py in
_validate_X_predict(self, X, check_input)
    405         """Validate the training data on predict (probabilities)."""
    406         if check_input:
--> 407             X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
    408                                     reset=False)
    409             if issparse(X) and (X.indices.dtype != np.intc or

c:\python39\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    435 
    436         if check_params.get('ensure_2d', True):
--> 437             self._check_n_features(X, reset=reset)
    438 
    439         return out

c:\python39\lib\site-packages\sklearn\base.py in
_check_n_features(self, X, reset)
    363 
    364         if n_features != self.n_features_in_:
--> 365             raise ValueError(
    366                 f"X has {n_features} features, but {self.__class__.__name__} "
    367                 f"is expecting {self.n_features_in_} features as input.")

ValueError: X has 10 features, but DecisionTreeClassifier is expecting 11 features as input
Run Code Online (Sandbox Code Playgroud)

小智 5

您的训练集和测试集中没有相同数量的特征,因为您pd.get_dummies()分别在训练集和测试集上使用该函数。您的测试集中有一个值,但训练集中没有。

要解决这个问题,最好的方法是使用OneHotEncoder()模块中sklearn.preprocessing带有参数的函数handle_unknown="ignore"

from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown="ignore")
oneh.fit(train_data[features])
X_test = oneh.transform(test_data[features])
Run Code Online (Sandbox Code Playgroud)

此外,为训练集和测试集(在您的情况下)采用不同的预处理工作流程并不是一个好的选择fillna()