Vir*_*mar 67 scikit-learn data-science
我是机器学习的新手,我一直在使用无监督学习技术.
该图显示了我的样本数据(完全清理后)屏幕截图: 示例数据
我有两个Pipline用于清理数据:
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
print(type(num_attribs))
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
Run Code Online (Sandbox Code Playgroud)
然后我做了这两个管道的联合,相同的代码如下所示:
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
Run Code Online (Sandbox Code Playgroud)
现在我试图在数据上做fit_transform 但它显示我的错误.
转型代码:
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
Run Code Online (Sandbox Code Playgroud)
错误消息:fit_transform()需要2个位置参数,但是给出了3个
Zai*_* E. 62
问题:
管道假设LabelBinarizer的fit_transform方法被定义为采用三个位置参数:
def fit_transform(self, x, y)
...rest of the code
Run Code Online (Sandbox Code Playgroud)
虽然它被定义为只需要两个:
def fit_transform(self, x):
...rest of the code
Run Code Online (Sandbox Code Playgroud)
可能解决方案
这可以通过制作一个可以处理3个位置参数的自定义变换器来解决:
导入并创建一个新类:
from sklearn.base import TransformerMixin #gives fit_transform method for free
class MyLabelBinarizer(TransformerMixin):
def __init__(self, *args, **kwargs):
self.encoder = LabelBinarizer(*args, **kwargs)
def fit(self, x, y=0):
self.encoder.fit(x)
return self
def transform(self, x, y=0):
return self.encoder.transform(x)
Run Code Online (Sandbox Code Playgroud)保持你的代码相同的,而不是只使用LabelBinarizer(),请使用我们创建的类:MyLabelBinarizer().
fit方法中:
self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_
Run Code Online (Sandbox Code Playgroud)
Ste*_*ley 55
我相信你的例子来自Scikit-Learn和TensorFlow的动手机器学习.不幸的是,我也遇到了这个问题.在最近的变化scikit-learn(0.19.0)改变LabelBinarizer的fit_transform方法.不幸的是,LabelBinarizer从来没有打算如何使用它的例子.您可以在此处和此处查看有关更改的信息.
在他们为此提出解决方案之前,您可以安装以前的版本(0.18.0),如下所示:
$ pip install scikit-learn==0.18.0
Run Code Online (Sandbox Code Playgroud)
运行之后,您的代码应该运行没有问题.
在将来,看起来正确的解决方案可能是使用CategoricalEncoder类或类似的类.他们多年来一直试图解决这个问题.您可以看到新类在这里,问题的进一步讨论在这里.
由于LabelBinarizer不允许超过2个位置参数,因此您应该创建自定义二进制化器
class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
def __init__(self, sparse_output=False):
self.sparse_output = sparse_output
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
enc = LabelBinarizer(sparse_output=self.sparse_output)
return enc.fit_transform(X)
num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scalar', StandardScaler())
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', CustomLabelBinarizer())
])
full_pipeline = FeatureUnion(transformer_list=[
('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline)
])
housing_prepared = full_pipeline.fit_transform(new_housing)
Run Code Online (Sandbox Code Playgroud)
我遇到了同样的问题并通过应用本书的Github repo中指定的解决方法使其工作.
警告:本书的早期版本此时使用了LabelBinarizer类.同样,这是不正确的:就像LabelEncoder类一样,LabelBinarizer类被设计为预处理标签,而不是输入功能.更好的解决方案是使用Scikit-Learn即将推出的CategoricalEncoder类:它很快将被添加到Scikit-Learn,同时您可以使用下面的代码(从Pull Request #9151复制 ).
为了节省一些grepping,这里是解决方法,只需粘贴并在前一个单元格中运行它:
# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, do not try to understand it (yet).
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse
class CategoricalEncoder(BaseEstimator, TransformerMixin):
def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
handle_unknown='error'):
self.encoding = encoding
self.categories = categories
self.dtype = dtype
self.handle_unknown = handle_unknown
def fit(self, X, y=None):
"""Fit the CategoricalEncoder to X.
Parameters
----------
X : array-like, shape [n_samples, n_feature]
The data to determine the categories of each feature.
Returns
-------
self
"""
if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
template = ("encoding should be either 'onehot', 'onehot-dense' "
"or 'ordinal', got %s")
raise ValueError(template % self.handle_unknown)
if self.handle_unknown not in ['error', 'ignore']:
template = ("handle_unknown should be either 'error' or "
"'ignore', got %s")
raise ValueError(template % self.handle_unknown)
if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
raise ValueError("handle_unknown='ignore' is not supported for"
" encoding='ordinal'")
X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
n_samples, n_features = X.shape
self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
for i in range(n_features):
le = self._label_encoders_[i]
Xi = X[:, i]
if self.categories == 'auto':
le.fit(Xi)
else:
valid_mask = np.in1d(Xi, self.categories[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(Xi[~valid_mask])
msg = ("Found unknown categories {0} in column {1}"
" during fit".format(diff, i))
raise ValueError(msg)
le.classes_ = np.array(np.sort(self.categories[i]))
self.categories_ = [le.classes_ for le in self._label_encoders_]
return self
def transform(self, X):
"""Transform X using one-hot encoding.
Parameters
----------
X : array-like, shape [n_samples, n_features]
The data to encode.
Returns
-------
X_out : sparse matrix or a 2-d array
Transformed input.
"""
X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
n_samples, n_features = X.shape
X_int = np.zeros_like(X, dtype=np.int)
X_mask = np.ones_like(X, dtype=np.bool)
for i in range(n_features):
valid_mask = np.in1d(X[:, i], self.categories_[i])
if not np.all(valid_mask):
if self.handle_unknown == 'error':
diff = np.unique(X[~valid_mask, i])
msg = ("Found unknown categories {0} in column {1}"
" during transform".format(diff, i))
raise ValueError(msg)
else:
# Set the problematic rows to an acceptable value and
# continue `The rows are marked `X_mask` and will be
# removed later.
X_mask[:, i] = valid_mask
X[:, i][~valid_mask] = self.categories_[i][0]
X_int[:, i] = self._label_encoders_[i].transform(X[:, i])
if self.encoding == 'ordinal':
return X_int.astype(self.dtype, copy=False)
mask = X_mask.ravel()
n_values = [cats.shape[0] for cats in self.categories_]
n_values = np.array([0] + n_values)
indices = np.cumsum(n_values)
column_indices = (X_int + indices[:-1]).ravel()[mask]
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
n_features)[mask]
data = np.ones(n_samples * n_features)[mask]
out = sparse.csc_matrix((data, (row_indices, column_indices)),
shape=(n_samples, indices[-1]),
dtype=self.dtype).tocsr()
if self.encoding == 'onehot-dense':
return out.toarray()
else:
return out
Run Code Online (Sandbox Code Playgroud)
小智 6
我认为您正在阅读本书中的示例:使用Scikit Learn和Tensorflow进行机器学习。在阅读第2章中的示例时,我遇到了同样的问题。
正如其他人提到的那样,问题在于sklearn的LabelBinarizer。与管道中的其他转换器相比,其fit_transform方法所需的args更少。(仅当其他变压器通常同时使用X和y时才使用y,有关详细信息,请参见此处)。这就是为什么当我们运行pipeline.fit_transform时,我们将更多的args馈入该转换器的原因。
我使用的一个简单修复方法是仅使用OneHotEncoder并将“ sparse”设置为False,以确保输出是与num_pipeline输出相同的numpy数组。(这样一来,您无需编写自己的自定义编码器)
您原来的cat_pipeline:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])
Run Code Online (Sandbox Code Playgroud)
您可以简单地将此部分更改为:
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('one_hot_encoder', OneHotEncoder(sparse=False))
])
Run Code Online (Sandbox Code Playgroud)
您可以从这里开始,一切都应该正常进行。
小智 5
简单地说,您可以做的是在管道之前定义以下类:
class NewLabelBinarizer(LabelBinarizer):
def fit(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X)
def transform(self, X, y=None):
return super(NewLabelBinarizer, self).transform(X)
def fit_transform(self, X, y=None):
return super(NewLabelBinarizer, self).fit(X).transform(X)
Run Code Online (Sandbox Code Playgroud)
然后其余的代码就像书中提到的那样,cat_pipeline在管道连接之前进行了微小的修改- 如下:
cat_pipeline = Pipeline([
("selector", DataFrameSelector(cat_attribs)),
("label_binarizer", NewLabelBinarizer())])
Run Code Online (Sandbox Code Playgroud)
你完成了!
| 归档时间: |
|
| 查看次数: |
19837 次 |
| 最近记录: |