label-encoder编码缺失值

sau*_*wal 26 python pandas scikit-learn

我使用标签编码器将分类数据转换为数值.

LabelEncoder如何处理缺失值?

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)
Run Code Online (Sandbox Code Playgroud)

输出:

array([1, 2, 3, 0, 4, 1])
Run Code Online (Sandbox Code Playgroud)

对于上面的示例,标签编码器将NaN值更改为类别.我怎么知道哪个类别代表缺失值?

duk*_*ody 13

不要使用LabelEncoder缺失值.我不知道scikit-learn您使用的是哪个版本,但在0.17.1中您的代码会提升TypeError: unorderable types: str() > float().

正如您在源代码中所看到的那样,numpy.unique会对要编码的数据使用,TypeError如果找到缺失值则会引发数据.如果要编码缺失值,请先将其类型更改为字符串:

a[pd.isnull(a)]  = 'NaN'
Run Code Online (Sandbox Code Playgroud)

  • 该模型以不同的方式对待缺失值(nan)和“ Nan”。一种变通方法是仅将LabelEnconder与非缺失值一起使用,并让nan值保持不变:df ['col'] = df ['col']。map(lambda x:le.transform([x])[0]如果type(x)== str else x) (2认同)

ulr*_*ich 8

您也可以在标记后使用掩码替换原始数据框

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN

original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)

A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN
Run Code Online (Sandbox Code Playgroud)


Ker*_*m T 6

您好,我为自己的工作做了一些计算黑客:

from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)
Run Code Online (Sandbox Code Playgroud)


Nic*_*ivi 5

这是我的解决方案,因为我对这里发布的解决方案不满意。我需要一个LabelEncoder可以保留我的缺失值,NaN以便之后使用 Imputer。所以我写了自己的LabelEncoder类。它适用于数据帧。

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder

class LabelEncoderByCol(BaseEstimator, TransformerMixin):
    def __init__(self,col):
        #List of column names in the DataFrame that should be encoded
        self.col = col
        #Dictionary storing a LabelEncoder for each column
        self.le_dic = {}
        for el in self.col:
            self.le_dic[el] = LabelEncoder()

    def fit(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            self.le_dic[el].fit(a)
        return self

    def transform(self,x,y=None):
        #Fill missing values with the string 'NaN'
        x[self.col] = x[self.col].fillna('NaN')
        for el in self.col:
            #Only use the values that are not 'NaN' to fit the Encoder
            a = x[el][x[el]!='NaN']
            #Store an ndarray of the current column
            b = x[el].to_numpy()
            #Replace the elements in the ndarray that are not 'NaN'
            #using the transformer
            b[b!='NaN'] = self.le_dic[el].transform(a)
            #Overwrite the column in the DataFrame
            x[el]=b
        #return the transformed DataFrame
        return x
Run Code Online (Sandbox Code Playgroud)

您可以输入一个 DataFrame,而不仅仅是一个 1-dim 系列。使用 col 您可以选择应该编码的列。

我想在这里提供一些反馈。