Mil*_*ell 5 python scikit-learn cross-validation grid-search
我想使用 scikit-learn 管道的第一步来生成玩具数据集,以评估我的分析性能。我想出的一个简单的示例解决方案如下所示:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn import cluster
class FeatureGenerator(TransformerMixin):
def __init__(self, num_features=None):
self.num_features = num_features
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X, **transform_params):
return np.array(
range(self.num_features*self.num_features)
).reshape(self.num_features,
self.num_features)
def get_params(self, deep=True):
return {"num_features": self.num_features}
def set_params(self, **parameters):
self.num_features = parameters["num_features"]
return self
Run Code Online (Sandbox Code Playgroud)
例如,这个运行中的变压器可以这样调用:
pipeline = Pipeline([
('pick_features', FeatureGenerator(100)),
('kmeans', cluster.KMeans())
])
pipeline = pipeline.fit(None)
classes = pipeline.predict(None)
print classes
Run Code Online (Sandbox Code Playgroud)
当我尝试通过此管道进行网格搜索时,这对我来说就变得很棘手:
parameter_sets = {
'pick_features__num_features' : [10,20,30],
'kmeans__n_clusters' : [2,3,4]
}
pipeline = Pipeline([
('pick_features', FeatureGenerator()),
('kmeans', cluster.KMeans())
])
g_search_estimator = GridSearchCV(pipeline, parameter_sets)
g_search_estimator.fit(None,None)
Run Code Online (Sandbox Code Playgroud)
网格搜索期望样本和标签作为输入,并且不像管道那样健壮,管道不会抱怨作为None输入参数:
TypeError: Expected sequence or array-like, got <type 'NoneType'>
Run Code Online (Sandbox Code Playgroud)
这是有道理的,因为网格搜索需要将数据集划分为不同的 cv 分区。
除了上面的示例之外,我还有很多参数可以在数据集生成步骤中进行调整。因此,我需要一个解决方案来将此步骤包含到我的参数选择交叉验证中。
问题:有没有办法从第一个变压器内部设置GridSearch 的Xs 和s ?y或者,调用具有多个不同数据集(最好是并行)的 GridSearch 的解决方案会是什么样子?或者有人尝试过定制GridSearchCV或者可以指出一些关于此的阅读材料吗?
您的代码非常干净,因此很高兴为您提供这个快速而肮脏的解决方案:
g_search_estimator.fit([1., 1., 1.],[1., 0., 0.])
g_search_estimator.best_params_
Run Code Online (Sandbox Code Playgroud)
输出:
[tons of int64 to float64 conversion warnings]
{'kmeans__n_clusters': 4, 'pick_features__num_features': 10}
Run Code Online (Sandbox Code Playgroud)
请注意,您需要 3 个样本,因为您正在进行(默认)3 倍交叉验证。
您收到的错误是由于GridSearchCV对象执行的检查而发生的,因此它发生在您的变压器有机会执行任何操作之前。所以我会对你的第一个问题说“不”:
有没有办法从第一个变压器内部设置 GridSearch 的 X 和 y ?
EDIT:
I realize this was unnecessarily confusing, the three following lines are equivalent:
g_search_estimator.fit([1., 1., 1.], [1., 0., 0.])
g_search_estimator.fit([1., 1., 1.], None)
g_search_estimator.fit([1., 1., 1.])
Sorry for hastily throwing random ys in there.
Some explanations about how the grid search computes scores for the different grid points: when you pass scoring=None to the GridSearchCV constructor (this is the default so that's what you have here), it asks the estimator for a score function. If there is such a function it is used for scoring. For KMeans the default score function is essentially the opposite of the sum of distances to cluster centers.
This is an unsupervised metrics so y is not necessary here.
Wrapping it up, you will always be able to:
set the Xs of the GridSearch from inside the first transformer
Just 'transform' the input X into something totally unrelated, no one will complain about it. You do need some input random_X though.
Now if you want to use supervised metrics (I have this feeling from your question) you'll need to specify y as well.
An easy scenario is one where you have a fixed y vector and you want to try several X with that. Then you can just do:
g_search_estimator.fit(random_X, y, scoring=my_scoring_function)
Run Code Online (Sandbox Code Playgroud)
and it should run fine. If you want to search over different values of y it will probably be a bit trickier.