Python sklearn - 确定LabelEncoder的编码顺序

Question

Python sklearn - 确定LabelEncoder的编码顺序

我希望确定 sklearn LabelEncoder 的标签（即 0,1,2,3,...）以适应分类变量可能值的特定顺序（例如 ['b', 'a', 'c', 'd'])。LabelEncoder 选择按字典序拟合标签，我想可以在这个例子中看到：

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
array(['a', 'b', 'c', 'd'], dtype='<U1')
le.transform(['a', 'b'])
array([0, 1])

Run Code Online (Sandbox Code Playgroud)

我怎样才能强制编码器坚持在 .fit 方法中第一次遇到的数据顺序（即，将“b”编码为 0，“a”编码为 1，“c”编码为 2，“d”编码为3）？

Answer 1

Viv*_*mar 7

你不能在原版中做到这一点。

LabelEncoder.fit()使用numpy.unique它将始终按排序返回数据，如source 中给出的：

def fit(...):
    y = column_or_1d(y, warn=True)
    self.classes_ = np.unique(y)
    return self

Run Code Online (Sandbox Code Playgroud)

所以如果你想这样做，你需要覆盖这个fit()函数。像这样的东西：

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import column_or_1d

class MyLabelEncoder(LabelEncoder):

    def fit(self, y):
        y = column_or_1d(y, warn=True)
        self.classes_ = pd.Series(y).unique()
        return self

Run Code Online (Sandbox Code Playgroud)

然后你可以这样做：

le = MyLabelEncoder()
le.fit(['b', 'a', 'c', 'd' ])
le.classes_
#Output:  array(['b', 'a', 'c', 'd'], dtype=object)

Run Code Online (Sandbox Code Playgroud)

在这里，我使用pandas.Series.unique()来获得独特的类。如果您因任何原因无法使用 Pandas，请参阅使用 numpy 执行此问题的此问题：

numpy唯一没有排序

归档时间：	7 年，10 月前
查看次数：	4766 次
最近记录：	5 年，7 月前