Using OrdinalEncoder to transform categorical values

Question

Using OrdinalEncoder to transform categorical values

我有一个包含很多列的数据集

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   M    O      56     160     Math
2   Harry M    A      76     192     Math
3   John  M    A      45     178     English
4   Nancy F    B      78     157     Biology
5   Mike  M    O      79     167     Math
6   Kate  F    AB     66     156     English
7   Mary  F    O      99     166     Science

Run Code Online (Sandbox Code Playgroud)

我想把它改成那样

No  Name  Sex  Blood  Grade  Height  Study
1   Tom   0    0      56     160     0
2   Harry 0    1      76     192     0
3   John  0    1      45     178     1
4   Nancy 1    2      78     157     2
5   Mike  0    0      79     167     0
6   Kate  1    3      66     156     1
7   Mary  0    0      99     166     3

Run Code Online (Sandbox Code Playgroud)

我知道有一个图书馆可以做到这一点

from sklearn.preprocessing import OrdinalEncoder

Run Code Online (Sandbox Code Playgroud)

我试过这个，但没有用

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])

Run Code Online (Sandbox Code Playgroud)

谁能帮我找出我做错了什么以及如何解决？

谢谢

Answer 1

abc*_*ire 16

你快到了！

基本上是fit方法，准备编码器（适合您的数据，即准备映射）但不转换数据。

您必须调用transform来转换数据，或者使用fit_transform哪个适合并转换相同的数据。

enc = OrdinalEncoder()
enc.fit(df[["Sex","Blood", "Study"]])
df[["Sex","Blood", "Study"]] = enc.transform(df[["Sex","Blood", "Study"]])

Run Code Online (Sandbox Code Playgroud)

或直接

enc = OrdinalEncoder()
df[["Sex","Blood", "Study"]] = enc.fit_transform(df[["Sex","Blood", "Study"]])

Run Code Online (Sandbox Code Playgroud)

注意：这些值不会是您提供的值，因为在内部使用 fit 方法numpy.unique给出按字母顺序而不是按出现顺序排序的结果。

正如你所看到的 enc.categories_

[array(['F', 'M'], dtype=object),
 array(['A', 'AB', 'B', 'O'], dtype=object),
 array(['Biology', 'English', 'Math', 'Science'], dtype=object)]```

Run Code Online (Sandbox Code Playgroud)

数组中的每个值都按其位置编码。（F 将被编码为 0 ， M 将被编码为 1）

Answer 2

Cre*_*edd 11

我认为重要的是要指出这不是变量有序编码的示例。Sex、Blood 和 Study 都不应该有一个顺序量表（并且也没有被提出问题的人建议）。序数数据具有排名（参见例如https://en.wikipedia.org/wiki/Ordinal_data）此处的这些示例没有排名。

如果您的变量是目标变量，您可以使用 LabelEncoder。（https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html）

然后你可以做这样的事情：

from sklearn.preprocessing import LabelEncoder

for col in ["Sex","Blood", "Study"]:
    df[col] = LabelEncoder().fit_transform(df[col])

Run Code Online (Sandbox Code Playgroud)

如果您的变量是特征，您应该使用 Ordinalencoder 来完成此操作。（请参阅对我的回答的评论）。

Ordinalencoder 的命名非常不幸，因为“序数”是从数学而非统计命名的角度来看的。

更多关于 sklearn 中 ordinal- 和 labelencoder 的区别：https ://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder

你在这里说得有道理，但同样重要的是要注意，我认为 LabelEncoder 在管道中工作得不太好。根据我在网上收集的信息，它仅适用于您的目标或响应变量。因此，为了做 OP 想做的事情，他们实际上建议使用 OrdinalEncoder 来实现这一点。 (2认同)

归档时间：	6 年，7 月前
查看次数：	15043 次
最近记录：	4 年，9 月前