1 python machine-learning scikit-learn
使用的数据集是kaggle titanic
错误显示在第 9 个单元格中,其余单元格本身运行良好
在[1]中:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
Run Code Online (Sandbox Code Playgroud)
在[2]中:
train_data.dtypes
Run Code Online (Sandbox Code Playgroud)
在[3]中:
train_data.isna().sum()
Run Code Online (Sandbox Code Playgroud)
在[4]中:
train_data = train_data.fillna(value = {'Age' :0, 'Embarked' :'u'})
Run Code Online (Sandbox Code Playgroud)
在[5]中:
train_data.isna().sum()
Run Code Online (Sandbox Code Playgroud)
在[6]中:
train_data.shape
Run Code Online (Sandbox Code Playgroud)
在[7]中:
test_data = test_data.fillna(value = {'Age' :0, 'Fare' :0})
Run Code Online (Sandbox Code Playgroud)
在[8]中:
test_data.shape
Run Code Online (Sandbox Code Playgroud)
在 [9] 中:就像在这个单元格中一样,我已经指定了要使用的特征,但为什么它说分类器需要 11 个特征
y = train_data["Survived"]
features = ["Pclass", "Sex", "SibSp", "Parch", "Age", "Fare", "Embarked"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")
Run Code Online (Sandbox Code Playgroud)
错误回溯
ValueError Traceback (most recent call last) <ipython-input-11-a7ceba9b896f> in <module>
7 model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=1)
8 model.fit(X, y)
----> 9 predictions = model.predict(X_test)
10
11 output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
628 The predicted classes.
629 """
--> 630 proba = self.predict_proba(X)
631
632 if self.n_outputs_ == 1:
c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
672 check_is_fitted(self)
673 # Check data
--> 674 X = self._validate_X_predict(X)
675
676 # Assign chunk of trees to jobs
c:\python39\lib\site-packages\sklearn\ensemble\_forest.py in
_validate_X_predict(self, X)
420 check_is_fitted(self)
421
--> 422 return self.estimators_[0]._validate_X_predict(X, check_input=True)
423
424 @property
c:\python39\lib\site-packages\sklearn\tree\_classes.py in
_validate_X_predict(self, X, check_input)
405 """Validate the training data on predict (probabilities)."""
406 if check_input:
--> 407 X = self._validate_data(X, dtype=DTYPE, accept_sparse="csr",
408 reset=False)
409 if issparse(X) and (X.indices.dtype != np.intc or
c:\python39\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
435
436 if check_params.get('ensure_2d', True):
--> 437 self._check_n_features(X, reset=reset)
438
439 return out
c:\python39\lib\site-packages\sklearn\base.py in
_check_n_features(self, X, reset)
363
364 if n_features != self.n_features_in_:
--> 365 raise ValueError(
366 f"X has {n_features} features, but {self.__class__.__name__} "
367 f"is expecting {self.n_features_in_} features as input.")
ValueError: X has 10 features, but DecisionTreeClassifier is expecting 11 features as input
Run Code Online (Sandbox Code Playgroud)
小智 5
您的训练集和测试集中没有相同数量的特征,因为您pd.get_dummies()分别在训练集和测试集上使用该函数。您的测试集中有一个值,但训练集中没有。
要解决这个问题,最好的方法是使用OneHotEncoder()模块中sklearn.preprocessing带有参数的函数handle_unknown="ignore":
from sklearn.preprocessing import OneHotEncoder
oneh = OneHotEncoder(handle_unknown="ignore")
oneh.fit(train_data[features])
X_test = oneh.transform(test_data[features])
Run Code Online (Sandbox Code Playgroud)
此外,为训练集和测试集(在您的情况下)采用不同的预处理工作流程并不是一个好的选择fillna()。
| 归档时间: |
|
| 查看次数: |
18955 次 |
| 最近记录: |