如何在 Sklearn 中执行 OneHotEncoding,获取值错误

pyd*_*pyd 0 python preprocessor scikit-learn sklearn-pandas one-hot-encoding

我刚开始学习机器学习,在练习其中一项任务时,我遇到了价值错误,但我遵循了与讲师相同的步骤。

我收到值错误,请帮忙。

天涯

     Country    Name
 0     AUS      Sri
 1     USA      Vignesh
 2     IND      Pechi
 3     USA      Raj
Run Code Online (Sandbox Code Playgroud)

首先我执行了标签编码,

X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])

out:
X
array([[0, 'Sri'],
       [2, 'Vignesh'],
       [1, 'Pechi'],
       [2, 'Raj']], dtype=object)
Run Code Online (Sandbox Code Playgroud)

然后对同一个 X 进行一次热编码

onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

ValueError                                Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   1900         """
   1901         return _transform_selected(X, self._fit_transform,
-> 1902                                    self.categorical_features, copy=True)
   1903 
   1904     def _transform(self, X):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1695     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1696     """
-> 1697     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1698 
   1699     if isinstance(selected, six.string_types) and selected == "all":

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'Raj'
Run Code Online (Sandbox Code Playgroud)

请编辑我的问题有什么问题,提前致谢!

Tho*_*ves 6

您现在可以直接转到OneHotEncoding而不使用LabelEncoder,随着我们向 0.22 版迈进,许多人可能希望通过这种方式避免警告和潜在错误(参见文档示例)。


示例代码 1,其中对所有列进行编码并明确指定类别:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()

print (X)
Run Code Online (Sandbox Code Playgroud)

代码示例 1 的输出:

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]
Run Code Online (Sandbox Code Playgroud)

示例代码 2 显示了用于指定类别的 'auto' 选项:

前 3 列对国家名称进行编码,后四列对个人名称进行编码。

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()

print (X)
Run Code Online (Sandbox Code Playgroud)

代码示例 2 的输出(与 1 相同):

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]
Run Code Online (Sandbox Code Playgroud)

示例代码 3,其中只有第一列是一个热编码:

现在,这是独特的部分。如果您只需要对数据的特定列进行一次热编码怎么办?

注意:为了便于说明,我将最后一列保留为字符串。实际上,当最后一列已经是数字时,这样做更有意义)。

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()

X = np.append(tmp, names.reshape(-1,1), axis=1)

print (X)
Run Code Online (Sandbox Code Playgroud)

代码示例 3 的输出:

[[1.0 0.0 0.0 'Pechi']
 [0.0 0.0 1.0 'Raj']
 [0.0 1.0 0.0 'Sri']
 [0.0 0.0 1.0 'Vignesh']]
Run Code Online (Sandbox Code Playgroud)