scikit-learn pipeline:对变压器参数进行网格搜索以生成数据

Mil*_*ell 5 python scikit-learn cross-validation grid-search

我想使用 scikit-learn 管道的第一步来生成玩具数据集,以评估我的分析性能。我想出的一个简单的示例解决方案如下所示:

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster

class FeatureGenerator(TransformerMixin):

    def __init__(self, num_features=None):
        self.num_features = num_features

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, **transform_params):
        return np.array(
            range(self.num_features*self.num_features)
        ).reshape(self.num_features,
                  self.num_features)

    def get_params(self, deep=True):
        return {"num_features": self.num_features}

    def set_params(self, **parameters):
        self.num_features = parameters["num_features"]
        return self
Run Code Online (Sandbox Code Playgroud)

例如,这个运行中的变压器可以这样调用:

pipeline = Pipeline([
    ('pick_features', FeatureGenerator(100)),
    ('kmeans', cluster.KMeans())
])

pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes
Run Code Online (Sandbox Code Playgroud)

当我尝试通过此管道进行网格搜索时,这对我来说就变得很棘手:

parameter_sets = {
    'pick_features__num_features' : [10,20,30],
    'kmeans__n_clusters' : [2,3,4]
}

pipeline = Pipeline([
    ('pick_features', FeatureGenerator()),
    ('kmeans', cluster.KMeans())
])

g_search_estimator = GridSearchCV(pipeline, parameter_sets)

g_search_estimator.fit(None,None)
Run Code Online (Sandbox Code Playgroud)

网格搜索期望样本和标签作为输入,并且不像管道那样健壮,管道不会抱怨作为None输入参数:

TypeError: Expected sequence or array-like, got <type 'NoneType'>
Run Code Online (Sandbox Code Playgroud)

这是有道理的,因为网格搜索需要将数据集划分为不同的 cv 分区。


除了上面的示例之外,我还有很多参数可以在数据集生成步骤中进行调整。因此,我需要一个解决方案来将此步骤包含到我的参数选择交叉验证中。

问题:有没有办法从第一个变压器内部设置GridSearch 的Xs 和s ?y或者,调用具有多个不同数据集(最好是并行)的 GridSearch 的解决方案会是什么样子?或者有人尝试过定制GridSearchCV或者可以指出一些关于此的阅读材料吗?

ldi*_*rer 1

您的代码非常干净,因此很高兴为您提供这个快速而肮脏的解决方案:

g_search_estimator.fit([1., 1., 1.],[1., 0., 0.])
g_search_estimator.best_params_
Run Code Online (Sandbox Code Playgroud)

输出:

[tons of int64 to float64 conversion warnings]
{'kmeans__n_clusters': 4, 'pick_features__num_features': 10}
Run Code Online (Sandbox Code Playgroud)

请注意,您需要 3 个样本,因为您正在进行(默认)3 倍交叉验证。

您收到的错误是由于GridSearchCV对象执行的检查而发生的,因此它发生在您的变压器有机会执行任何操作之前。所以我会对你的第一个问题说“不”:

有没有办法从第一个变压器内部设置 GridSearch 的 X 和 y ?

EDIT:
I realize this was unnecessarily confusing, the three following lines are equivalent: g_search_estimator.fit([1., 1., 1.], [1., 0., 0.]) g_search_estimator.fit([1., 1., 1.], None) g_search_estimator.fit([1., 1., 1.])

Sorry for hastily throwing random ys in there.

Some explanations about how the grid search computes scores for the different grid points: when you pass scoring=None to the GridSearchCV constructor (this is the default so that's what you have here), it asks the estimator for a score function. If there is such a function it is used for scoring. For KMeans the default score function is essentially the opposite of the sum of distances to cluster centers.
This is an unsupervised metrics so y is not necessary here.

Wrapping it up, you will always be able to:

set the Xs of the GridSearch from inside the first transformer

Just 'transform' the input X into something totally unrelated, no one will complain about it. You do need some input random_X though.
Now if you want to use supervised metrics (I have this feeling from your question) you'll need to specify y as well.
An easy scenario is one where you have a fixed y vector and you want to try several X with that. Then you can just do:

g_search_estimator.fit(random_X, y, scoring=my_scoring_function)
Run Code Online (Sandbox Code Playgroud)

and it should run fine. If you want to search over different values of y it will probably be a bit trickier.