sau*_*wal 26 python pandas scikit-learn
我使用标签编码器将分类数据转换为数值.
LabelEncoder如何处理缺失值?
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
le.fit_transform(a)
Run Code Online (Sandbox Code Playgroud)
输出:
array([1, 2, 3, 0, 4, 1])
Run Code Online (Sandbox Code Playgroud)
对于上面的示例,标签编码器将NaN值更改为类别.我怎么知道哪个类别代表缺失值?
duk*_*ody 13
不要使用LabelEncoder缺失值.我不知道scikit-learn您使用的是哪个版本,但在0.17.1中您的代码会提升TypeError: unorderable types: str() > float().
正如您在源代码中所看到的那样,它numpy.unique会对要编码的数据使用,TypeError如果找到缺失值则会引发数据.如果要编码缺失值,请先将其类型更改为字符串:
a[pd.isnull(a)] = 'NaN'
Run Code Online (Sandbox Code Playgroud)
您也可以在标记后使用掩码替换原始数据框
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
original = df
mask = df_1.isnull()
A B C
0 False False False
1 True False False
2 False False True
df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)
A B C
0 1.0 0 1.0
1 NaN 1 0.0
2 2.0 2 NaN
Run Code Online (Sandbox Code Playgroud)
您好,我为自己的工作做了一些计算黑客:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
a = pd.DataFrame(['A','B','C',np.nan,'D','A'])
le = LabelEncoder()
### fit with the desired col, col in position 0 for this example
fit_by = pd.Series([i for i in a.iloc[:,0].unique() if type(i) == str])
le.fit(fit_by)
### Set transformed col leaving np.NaN as they are
a["transformed"] = fit_by.apply(lambda x: le.transform([x])[0] if type(x) == str else x)
Run Code Online (Sandbox Code Playgroud)
这是我的解决方案,因为我对这里发布的解决方案不满意。我需要一个LabelEncoder可以保留我的缺失值,NaN以便之后使用 Imputer。所以我写了自己的LabelEncoder类。它适用于数据帧。
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelEncoder
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].to_numpy()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed DataFrame
return x
Run Code Online (Sandbox Code Playgroud)
您可以输入一个 DataFrame,而不仅仅是一个 1-dim 系列。使用 col 您可以选择应该编码的列。
我想在这里提供一些反馈。
| 归档时间: |
|
| 查看次数: |
21151 次 |
| 最近记录: |