nig*_*bat 52 python pandas scikit-learn imputation
我有一些文本类型的pandas数据.这些文本列中包含一些NaN值.我想要做的就是通过sklearn.preprocessing.Imputer(以最常见的值取代NaN )来归咎于那些NaN .问题在于实施.假设有一个包含30列的Pandas数据帧df,其中10列具有分类性质.一旦我跑:
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
imp.fit(df)
Run Code Online (Sandbox Code Playgroud)
Python生成一个error: 'could not convert string to float: 'run1'',其中'run1'是来自第一列的普通(非缺失)值,带有分类数据.
任何帮助都会非常受欢迎
sve*_*ser 84
要使用数字列的平均值和非数字列的最常用值,您可以执行以下操作.您可以进一步区分整数和浮点数.我想使用整数列的中位数可能是有意义的.
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class DataFrameImputer(TransformerMixin):
def __init__(self):
"""Impute missing values.
Columns of dtype object are imputed with the most frequent value
in column.
Columns of other types are imputed with mean of column.
"""
def fit(self, X, y=None):
self.fill = pd.Series([X[c].value_counts().index[0]
if X[c].dtype == np.dtype('O') else X[c].mean() for c in X],
index=X.columns)
return self
def transform(self, X, y=None):
return X.fillna(self.fill)
data = [
['a', 1, 2],
['b', 1, 1],
['b', 2, 2],
[np.nan, np.nan, np.nan]
]
X = pd.DataFrame(data)
xt = DataFrameImputer().fit_transform(X)
print('before...')
print(X)
print('after...')
print(xt)
Run Code Online (Sandbox Code Playgroud)
打印,
before...
0 1 2
0 a 1 2
1 b 1 1
2 b 2 2
3 NaN NaN NaN
after...
0 1 2
0 a 1.000000 2.000000
1 b 1.000000 1.000000
2 b 2.000000 2.000000
3 b 1.333333 1.666667
Run Code Online (Sandbox Code Playgroud)
您可以将其sklearn_pandas.CategoricalImputer用于分类列。细节:
首先,(从书中动手机器学习与Scikit,学习和TensorFlow),你可以对数字和字符串/类别特征,其中每个subpipeline的第一变压器是采用列名的列表中选择subpipelines(和full_pipeline.fit_transform()需要pandas DataFrame):
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
Run Code Online (Sandbox Code Playgroud)
然后,您可以将这些子管道与结合使用sklearn.pipeline.FeatureUnion,例如:
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline)
])
Run Code Online (Sandbox Code Playgroud)
现在,num_pipeline您可以在中简单地使用sklearn.preprocessing.Imputer(),但在中cat_pipline,您可以CategoricalImputer()从sklearn_pandas包中使用。
注意: sklearn-pandas软件包可以通过进行安装pip install sklearn-pandas,但它作为导入import sklearn_pandas
有一个包sklearn-pandas可以选择分类变量
https://github.com/scikit-learn-contrib/sklearn-pandas#categoricalimputer
>>> from sklearn_pandas import CategoricalImputer
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
>>> imputer = CategoricalImputer()
>>> imputer.fit_transform(data)
array(['a', 'b', 'b', 'b'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
Strategy = 'most_frequent' 只能与定量特征一起使用,不能与定性特征一起使用。这种定制的 impuer 可用于定性和定量。此外,使用 scikit learn imputer,我们可以将其用于整个数据帧(如果所有特征都是定量的),或者我们可以使用“for 循环”和类似类型的特征/列的列表(参见下面的示例)。但自定义输入器可以与任何组合一起使用。
from sklearn.preprocessing import Imputer
impute = Imputer(strategy='mean')
for cols in ['quantitative_column', 'quant']: # here both are quantitative features.
xx[cols] = impute.fit_transform(xx[[cols]])
Run Code Online (Sandbox Code Playgroud)自定义输入器:
from sklearn.preprocessing import Imputer
from sklearn.base import TransformerMixin
class CustomImputer(TransformerMixin):
def __init__(self, cols=None, strategy='mean'):
self.cols = cols
self.strategy = strategy
def transform(self, df):
X = df.copy()
impute = Imputer(strategy=self.strategy)
if self.cols == None:
self.cols = list(X.columns)
for col in self.cols:
if X[col].dtype == np.dtype('O') :
X[col].fillna(X[col].value_counts().index[0], inplace=True)
else : X[col] = impute.fit_transform(X[[col]])
return X
def fit(self, *_):
return self
Run Code Online (Sandbox Code Playgroud)数据框:
X = pd.DataFrame({'city':['tokyo', np.NaN, 'london', 'seattle', 'san
francisco', 'tokyo'],
'boolean':['yes', 'no', np.NaN, 'no', 'no', 'yes'],
'ordinal_column':['somewhat like', 'like', 'somewhat like', 'like',
'somewhat like', 'dislike'],
'quantitative_column':[1, 11, -.5, 10, np.NaN, 20]})
city boolean ordinal_column quantitative_column
0 tokyo yes somewhat like 1.0
1 NaN no like 11.0
2 london NaN somewhat like -0.5
3 seattle no like 10.0
4 san francisco no somewhat like NaN
5 tokyo yes dislike 20.0
Run Code Online (Sandbox Code Playgroud)1) 可与类似类型的功能列表一起使用。
cci = CustomImputer(cols=['city', 'boolean']) # here default strategy = mean
cci.fit_transform(X)
Run Code Online (Sandbox Code Playgroud)可以与策略=中位数一起使用
sd = CustomImputer(['quantitative_column'], strategy = 'median')
sd.fit_transform(X)
Run Code Online (Sandbox Code Playgroud)3)可以与整个数据框一起使用,它将使用默认平均值(或者我们也可以用中位数更改它。对于定性特征,它使用策略 = 'most_frequent' 和定量平均值/中位数。
call = CustomImputer()
call.fit_transform(X)
Run Code Online (Sandbox Code Playgroud)| 归档时间: |
|
| 查看次数: |
50543 次 |
| 最近记录: |