LabelEncoder适用于熊猫df的顺序

Question

LabelEncoder适用于熊猫df的顺序

我在熊猫df中的一列上安装了scikit-learn LabelEncoder。

如何确定将遇到的字符串映射到整数的顺序？它是确定性的吗？

更重要的是，我可以指定此顺序吗？

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"]) 
print encoded
### this prints [0 2 3 1]

Run Code Online (Sandbox Code Playgroud)

我希望le.classes_是["first", "second", "third", "fourth"]，然后encoded是[0 1 2 3]，因为这是字符串在列中出现的顺序。能做到吗？

Answer 1

Mep*_*phy 7

它是按排序顺序完成的。在字符串的情况下，它是按字母顺序完成的。没有这方面的文档，但是查看LabelEncoder.transform的源代码，我们可以看到这项工作主要委托给函数numpy.setdiff1d，其中包含以下文档：

求两个数组的集差。

返回ar1 中不在 ar2 中的排序的唯一值。

（强调我的）。

请注意，由于这没有记录，它可能是实现定义的并且可以在版本之间更改。可能只是我查看的版本使用了排序顺序，而其他版本的 scikit-learn 可能会改变这种行为（通过不使用 numpy.setdiff1d）。

Answer 2

SaT*_*aTa 6

我也有点惊讶我无法向提供订单LabelEncoder。一行解决方案可以是这样的：

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third', 'fourth'].index(x))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，4 月前
查看次数：	2315 次
最近记录：	9 年，4 月前