scikit-learn pipeline：对变压器参数进行网格搜索以生成数据

Question

scikit-learn pipeline：对变压器参数进行网格搜索以生成数据

Mil*_*ell 5 python scikit-learn cross-validation grid-search

我想使用 scikit-learn 管道的第一步来生成玩具数据集，以评估我的分析性能。我想出的一个简单的示例解决方案如下所示：

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster

class FeatureGenerator(TransformerMixin):

    def __init__(self, num_features=None):
        self.num_features = num_features

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, **transform_params):
        return np.array(
            range(self.num_features*self.num_features)
        ).reshape(self.num_features,
                  self.num_features)

    def get_params(self, deep=True):
        return {"num_features": self.num_features}

    def set_params(self, **parameters):
        self.num_features = parameters["num_features"]
        return self

Run Code Online (Sandbox Code Playgroud)

例如，这个运行中的变压器可以这样调用：

pipeline = Pipeline([
    ('pick_features', FeatureGenerator(100)),
    ('kmeans', cluster.KMeans())
])

pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes

Run Code Online (Sandbox Code Playgroud)

当我尝试通过此管道进行网格搜索时，这对我来说就变得很棘手：

parameter_sets = {
    'pick_features__num_features' : [10,20,30],
    'kmeans__n_clusters' : [2,3,4]
}

pipeline = Pipeline([
    ('pick_features', FeatureGenerator()),
    ('kmeans', cluster.KMeans())
])

g_search_estimator = GridSearchCV(pipeline, parameter_sets)

g_search_estimator.fit(None,None)

Run Code Online (Sandbox Code Playgroud)

网格搜索期望样本和标签作为输入，并且不像管道那样健壮，管道不会抱怨作为None输入参数：

TypeError: Expected sequence or array-like, got <type 'NoneType'>

Run Code Online (Sandbox Code Playgroud)

这是有道理的，因为网格搜索需要将数据集划分为不同的 cv 分区。

除了上面的示例之外，我还有很多参数可以在数据集生成步骤中进行调整。因此，我需要一个解决方案来将此步骤包含到我的参数选择交叉验证中。

问题：有没有办法从第一个变压器内部设置GridSearch 的Xs 和s ？y或者，调用具有多个不同数据集（最好是并行）的 GridSearch 的解决方案会是什么样子？或者有人尝试过定制GridSearchCV或者可以指出一些关于此的阅读材料吗？

Answer 1

ldi*_*rer 1

您的代码非常干净，因此很高兴为您提供这个快速而肮脏的解决方案：

g_search_estimator.fit([1., 1., 1.],[1., 0., 0.])
g_search_estimator.best_params_

Run Code Online (Sandbox Code Playgroud)

输出：

[tons of int64 to float64 conversion warnings]
{'kmeans__n_clusters': 4, 'pick_features__num_features': 10}

Run Code Online (Sandbox Code Playgroud)

请注意，您需要 3 个样本，因为您正在进行（默认）3 倍交叉验证。

您收到的错误是由于GridSearchCV对象执行的检查而发生的，因此它发生在您的变压器有机会执行任何操作之前。所以我会对你的第一个问题说“不”：

有没有办法从第一个变压器内部设置 GridSearch 的 X 和 y ？

EDIT:
I realize this was unnecessarily confusing, the three following lines are equivalent: g_search_estimator.fit([1., 1., 1.], [1., 0., 0.]) g_search_estimator.fit([1., 1., 1.], None) g_search_estimator.fit([1., 1., 1.])

Sorry for hastily throwing random ys in there.

Some explanations about how the grid search computes scores for the different grid points: when you pass scoring=None to the GridSearchCV constructor (this is the default so that's what you have here), it asks the estimator for a score function. If there is such a function it is used for scoring. For KMeans the default score function is essentially the opposite of the sum of distances to cluster centers.
This is an unsupervised metrics so y is not necessary here.

Wrapping it up, you will always be able to:

set the Xs of the GridSearch from inside the first transformer

Just 'transform' the input X into something totally unrelated, no one will complain about it. You do need some input random_X though.
Now if you want to use supervised metrics (I have this feeling from your question) you'll need to specify y as well.
An easy scenario is one where you have a fixed y vector and you want to try several X with that. Then you can just do:

g_search_estimator.fit(random_X, y, scoring=my_scoring_function)

Run Code Online (Sandbox Code Playgroud)

and it should run fine. If you want to search over different values of y it will probably be a bit trickier.

归档时间：	10 年，4 月前
查看次数：	4557 次
最近记录：	10 年，4 月前