cls*_*udt 5 python machine-learning scikit-learn
我正在研究一个使用ColumnTransformer和LabelEncoder预处理众所周知的Titanic数据集的示例X:
Age Embarked Fare Sex
0 22.0 S 7.2500 male
1 38.0 C 71.2833 female
2 26.0 S 7.9250 female
3 35.0 S 53.1000 female
4 35.0 S 8.0500 male
Run Code Online (Sandbox Code Playgroud)
像这样调用变压器:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
ColumnTransformer(
transformers=[
("label-encode categorical", LabelEncoder(), ["Sex", "Embarked"])
]
).fit(X).transform(X)
Run Code Online (Sandbox Code Playgroud)
结果是:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-54-fd5a05b7e47e> in <module>
4 ("label-encode categorical", LabelEncoder(), ["Sex", "Embarked"])
5 ]
----> 6 ).fit(X).transform(X)
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit(self, X, y)
418 # we use fit_transform to make sure to set sparse_output_ (for which we
419 # need the transformed data) to have consistent output type in predict
--> 420 self.fit_transform(X, y=y)
421 return self
422
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in fit_transform(self, X, y)
447 self._validate_remainder(X)
448
--> 449 result = self._fit_transform(X, y, _fit_transform_one)
450
451 if not result:
~/anaconda3/lib/python3.7/site-packages/sklearn/compose/_column_transformer.py in _fit_transform(self, X, y, func, fitted)
391 _get_column(X, column), y, weight)
392 for _, trans, column, weight in self._iter(
--> 393 fitted=fitted, replace_strings=True))
394 except ValueError as e:
395 if "Expected 2D array, got 1D array instead" in str(e):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
--> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
--> 759 self._dispatch(tasks)
760 return True
761
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
--> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 """Schedule a func to be run"""
--> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
547 # Don't delay the application, to avoid keeping the input
548 # arguments in memory
--> 549 self.results = batch()
550
551 def get(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
--> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _fit_transform_one(transformer, X, y, weight, **fit_params)
612 def _fit_transform_one(transformer, X, y, weight, **fit_params):
613 if hasattr(transformer, 'fit_transform'):
--> 614 res = transformer.fit_transform(X, y, **fit_params)
615 else:
616 res = transformer.fit(X, y, **fit_params).transform(X)
TypeError: fit_transform() takes 2 positional arguments but 3 were given
Run Code Online (Sandbox Code Playgroud)
这是什么问题**fit_params?在我看来,这似乎是一个错误,sklearn或者至少是不兼容。
Ven*_*lam 19
这对您的目的不起作用有两个主要原因。
LabelEncoder()被设计用于目标变量(y)。这就是在columnTransformer()尝试 feed时获取位置参数错误的原因X, y=None, fit_params={}。从文档:
使用 0 到 n_classes-1 之间的值对标签进行编码。
fit(y)
适合标签编码器参数:
y :形状类似数组的 (n_samples,)
目标值。
LabelEncoder()不能采用 2D 数组(一次基本上有多个特征),因为它只需要 1Dy值。简短回答 - 我们不应该LabelEncoder()用于输入特征。
现在,对输入特征进行编码的解决方案是什么?
使用OrdinalEncoder(),如果你的特点是有序的特征或OneHotEncoder()在正常情况下的功能。
例子:
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
>>> X = np.array([[1000., 100., 'apple', 'green'],
... [1100., 100., 'orange', 'blue']])
>>> ct = ColumnTransformer(
... [("ordinal", OrdinalEncoder(), [0, 1]),
("nominal", OneHotEncoder(), [2, 3])])
>>> ct.fit_transform(X)
array([[0., 0., 1., 0., 0., 1.],
[1., 0., 0., 1., 1., 0.]])
Run Code Online (Sandbox Code Playgroud)
我相信这实际上是LabelEncoder. 该LabelEncoder.fit方法只接受self, 和y作为参数(这很奇怪,因为大多数转换器对象都具有 范式fit(X, y=None, **fit_params))。无论如何,在管道中,fit_params无论您通过什么,都会调用变压器。在这种特殊情况下,传递给LabelEncoder.fitare的确切参数X和一个空字典{}。从而引发错误。
从我的角度来看,这是一个错误LabelEncoder,但你应该采取了与sklearn人,因为他们可能有一些原因,实施fit不同的方法。
| 归档时间: |
|
| 查看次数: |
452 次 |
| 最近记录: |