在 Python 中编码序数值

Anh*_*ora 2 python machine-learning neural-network scikit-learn deep-learning

我正在尝试在数据集的第三列中对序数分类值进行编码,其中“Tiny Mongra”具有最低值,“1st Wand”具有最高值。它与使用小、中和大尺寸同义,其中当前数据集表示一粒米的大小。

当我运行此代码片段时,我不断收到以下错误:

Traceback (most recent call last):

  File "<ipython-input-1-ae4501cc0ac1>", line 19, in <module>
    X[:, 2] = ordinalencoder_X_3.fit_transform(X[:, 2])

  File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/base.py", line 462, in fit_transform
    return self.fit(X, **fit_params).transform(X)

  File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 794, in fit
    self._fit(X)

  File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 61, in _fit
    X = self._check_X(X)

  File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/_encoders.py", line 47, in _check_X
    X_temp = check_array(X, dtype=None)

  File "/Users/anhad/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py", line 552, in check_array
    "if it contains a single sample.".format(array))

ValueError: Expected 2D array, got 1D array instead:
array=['1st Wand' '1st Wand' '1st Wand' ... '1st Wand' '1st Wand' '1st Wand'].
Run Code Online (Sandbox Code Playgroud)

经过进一步检查,我发现该错误并不是警告我有关分类数据列表的信息,而是指我想要编码的列。由于某种原因,它认为该列是以下形式的一维数组:

array=['1st Wand' '1st Wand' '1st Wand' '1st Wand' '1st Wand' 'Dubar' '2nd Wand'
 'Tibar' 'Mongra' '1st Wand' '1st Wand' '1st Wand' '1st Wand' '1st Wand'
 '1st Wand' '2nd Wand' 'Super Dubar' 'Super Tibar' ... '1st Wand' '1st Wand'].
Run Code Online (Sandbox Code Playgroud)

这很奇怪,因为我使用 LabelEncoder 来拟合数据集中的其他分类值,并且它们工作正常。

这是数据的链接。参见“数据”表:

https://docs.google.com/spreadsheets/d/12nAU5QztVnVroRYDsRDsZGUyBpBTwAD5yMmbMaAxnHQ/edit?usp=sharing

这是完整的代码。参考最后一部分:

import numpy as np
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Ryze Price NN Data.csv')
X = dataset.iloc[:, 1:7].values
y = dataset.iloc[:, 7].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])

labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])

# SEE THIS PART
category_array = ["Tiny Mongra","Mini Mongra","Mongra","Super Mongra","Mini Dubar","Dubar","Super Dubar","Mini Tibar","Tibar","Super Tibar","2nd Wand","Super 2nd Wand","1st Wand"]
ordinalencoder_X_3 = OrdinalEncoder(categories=category_array)
X[:, 2] = ordinalencoder_X_3.fit_transform(np.array(X[:,2])
Run Code Online (Sandbox Code Playgroud)

我希望分类数据编码如下:“Tiny Mongra”应编码为 0 。。“第一根魔杖”应编码为 12

Raf*_*faó 5

LabelEncoder和之间的主要区别OrdinalEncoder是它们的目的:

  • LabelEncoder应该用于目标变量,
  • OrdinalEncoder应该用于特征变量。

一般来说,它们的工作原理是相同的,但是:

  • LabelEncoder需要y: 形状为 [n_samples] 的类似数组
  • OrdinalEncoder需要X:类似数组,形状 [n_samples, n_features]

如果您只想将分类变量的值编码为0, 1, ..., n,请使用LabelEncoder与 X1 和 X2 相同的方法。

labelencoder_X_3 = LabelEncoder()
X[:, 2] = labelencoder_X_3.fit_transform(X[:, 2])
Run Code Online (Sandbox Code Playgroud)

但我会OrdinalEncoder同时转换所有三个变量:

ordinalencoder_X = OrdinalEncoder()
X[:, 0:3] = ordinalencoder_X.fit_transform(X[:, 0:3])
Run Code Online (Sandbox Code Playgroud)