ValueError:在转换过程中在第 7 列中发现未知类别 ['?'] - 乳腺癌数据集

Peg*_*s18 5 python machine-learning scikit-learn

我正在尝试使用 1988 年 UCI 乳腺癌发病率存储库(https://archive.ics.uci.edu/ml/datasets/Breast+Cancer)来解决分类机器学习问题。我不断收到以下错误,尽管不一致。有时该算法会直接运行到训练模型并预测测试准确性,有时它会在 OneHotEncoding 上失败并显示以下错误:

ohe = OneHotEncoder()
ohe.fit(X_train)
X_train_encoded = ohe.transform(X_train)
X_test_encoded = ohe.transform(X_test)
Run Code Online (Sandbox Code Playgroud)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-2cfd638a5b4d> in <module>()
      2 ohe.fit(X_train)
      3 X_train_encoded = ohe.transform(X_train)
----> 4 X_test_encoded = ohe.transform(X_test)

1 frames
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_encoders.py in _transform(self, X, handle_unknown)
    122                     msg = ("Found unknown categories {0} in column {1}"
    123                            " during transform".format(diff, i))
--> 124                     raise ValueError(msg)
    125                 else:
    126                     # Set the problematic rows to an acceptable value and

ValueError: Found unknown categories ['?'] in column 7 during transform
Run Code Online (Sandbox Code Playgroud)

我尝试在 Colab 和 Spyder 中运行,但遇到了同样的问题,不知道哪里出了问题。我在分割数据集然后编码之前输入缺失值,但即使删除 SimpleImputer 我仍然收到错误。

dataset = pd.read_csv('breast-cancer.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer.fit(X)
X_imputed = imputer.transform(X)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size = 0.25)

ohe = OneHotEncoder()
ohe.fit(X_train)
X_train_encoded = ohe.transform(X_train)
X_test_encoded = ohe.transform(X_test)

<-- Code stops running here -->

le = LabelEncoder()
le.fit(y_train)
y_train_encoded = le.transform(y_train)
y_test_encoded = le.transform(y_test)
Run Code Online (Sandbox Code Playgroud)

小智 10

测试数据可能包含训练数据中不存在的新条目。\n你能试试这个吗?

\n\n

ohe = OneHotEncoder(handle_unknown = "ignore")

\n\n

关于此参数:如果转换期间存在未知分类特征,是否引发错误或忽略(默认为引发)。当此参数设置为 \xe2\x80\x98ignore\xe2\x80\x99 且在转换过程中遇到未知类别时,该功能生成的 one-hot 编码列将全为零。

\n\n

更多这里:

\n\n

https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.OneHotEncoder.html

\n