label-encoder编码缺失值

Question

label-encoder编码缺失值

我使用标签编码器将分类数据转换为数值.

LabelEncoder如何处理缺失值？

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)

Run Code Online (Sandbox Code Playgroud)

输出:

array([1, 2, 3, 0, 4, 1])

Run Code Online (Sandbox Code Playgroud)

对于上面的示例,标签编码器将NaN值更改为类别.我怎么知道哪个类别代表缺失值？

Answer 1

duk*_*ody 13

不要使用LabelEncoder缺失值.我不知道scikit-learn您使用的是哪个版本,但在0.17.1中您的代码会提升TypeError: unorderable types: str() > float().

正如您在源代码中所看到的那样,它numpy.unique会对要编码的数据使用,TypeError如果找到缺失值则会引发数据.如果要编码缺失值,请先将其类型更改为字符串:

a[pd.isnull(a)]  = 'NaN'

Run Code Online (Sandbox Code Playgroud)

该模型以不同的方式对待缺失值（nan）和“ Nan”。一种变通方法是仅将LabelEnconder与非缺失值一起使用，并让nan值保持不变：df ['col'] = df ['col']。map（lambda x：le.transform（[x]）[0]如果type（x）== str else x） (2认同)

Answer 2

ulr*_*ich 8

您也可以在标记后使用掩码替换原始数据框

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN

original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)

A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN

Run Code Online (Sandbox Code Playgroud)

Answer 3

Ker*_*m T 6

您好，我为自己的工作做了一些计算黑客：

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)

Run Code Online (Sandbox Code Playgroud)

Answer 4

Nic*_*ivi 5

这是我的解决方案，因为我对这里发布的解决方案不满意。我需要一个LabelEncoder可以保留我的缺失值，NaN以便之后使用 Imputer。所以我写了自己的LabelEncoder类。它适用于数据帧。

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].to_numpy()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed DataFrame
        return x

Run Code Online (Sandbox Code Playgroud)

您可以输入一个 DataFrame，而不仅仅是一个 1-dim 系列。使用 col 您可以选择应该编码的列。

我想在这里提供一些反馈。

归档时间：	9 年，8 月前
查看次数：	21151 次
最近记录：	6 年，2 月前