将自定义函数放入 Sklearn 管道中

Fra*_*cis 10 pipeline machine-learning feature-selection scikit-learn cross-validation

在我的分类方案中,有几个步骤,包括:

  1. SMOTE(合成少数过采样技术)
  2. 特征选择的 Fisher 标准
  3. 标准化(Z-score 标准化)
  4. SVC(支持向量分类器)

在上述方案中要调整的主要参数是百分位数 (2.) 和 SVC (4.) 的超参数,我想通过网格搜索进行调整。

当前的解决方案构建了一个“部分”管道,包括方案中的第 3 步和第 4 步,并将方案clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))]) 分解为两部分:

  1. 调整特征的百分位数以保持通过第一次网格搜索

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for percentile in percentiles:
            # Fisher returns the indices of the selected features specified by the parameter 'percentile'
            selected_ind = Fisher(X_train, y_train, percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    
    Run Code Online (Sandbox Code Playgroud)

    f1 分数将被存储,然后通过所有百分位数的所有折叠分区进行平均,并返回具有最佳 CV 分数的百分位数。将“percentile for loop”作为内部循环的目的是为了允许公平竞争,因为我们在所有百分位数的所有折叠分区中拥有相同的训练数据(包括合成数据)。

  2. 确定百分位数后,通过第二次网格搜索调整超参数

    skf = StratifiedKFold(y)
    for train_ind, test_ind in skf:
        X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
        # SMOTE synthesizes the training data (we want to keep test data intact)
        X_train, y_train = SMOTE(X_train, y_train)
        for parameters in parameter_comb:
            # Select the features based on the tuned percentile
            selected_ind = Fisher(X_train, y_train, best_percentile) 
            X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
            clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
            model = clf.fit(X_train_selected, y_train)
            y_predict = model.predict(X_test_selected)
            f1 = f1_score(y_predict, y_test)
    
    Run Code Online (Sandbox Code Playgroud)

它以非常相似的方式完成,除了我们为 SVC 调整超参数而不是要选择的特征的百分比。

我的问题是:

  1. 在当前的解决方案中,我只涉及 3. 和 4.clf并在如上所述的两个嵌套循环中执行 1. 和 2. 有点“手动”。有没有办法将所有四个步骤都包含在一个管道中并一次完成整个过程?

  2. 如果可以保留第一个嵌套循环,那么是否可以(以及如何)使用单个管道简化下一个嵌套循环

    clf_all = Pipeline([('smote', SMOTE()),
                        ('fisher', Fisher(percentile=best_percentile))
                        ('normal',preprocessing.StandardScaler()),
                        ('svc',svm.SVC(class_weight='auto'))]) 
    
    Run Code Online (Sandbox Code Playgroud)

    并仅GridSearchCV(clf_all, parameter_comb)用于调整?

    请注意,仅对每个折叠分区中的训练数据都必须执行SMOTEFisher(排名标准)。

任何评论都将不胜感激。

SMOTEFisher显示在下面:

def Fscore(X, y, percentile=None):
    X_pos, X_neg = X[y==1], X[y==0]
    X_mean = X.mean(axis=0)
    X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
    deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
    num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
    F = num/deno
    sort_F = argsort(F)[::-1]
    n_feature = (float(percentile)/100)*shape(X)[1]
    ind_feature = sort_F[:ceil(n_feature)]
    return(ind_feature)
Run Code Online (Sandbox Code Playgroud)

SMOTE来自https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py,它返回合成数据。我修改了它以返回与合成数据堆叠在一起的原始输入数据及其标签和合成数据。

def smote(X, y):
    n_pos = sum(y==1), sum(y==0)
    n_syn = (n_neg-n_pos)/float(n_pos) 
    X_pos = X[y==1]
    X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
    y_syn = np.ones(shape(X_syn)[0])
    X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
    return(X, y)
Run Code Online (Sandbox Code Playgroud)

Arj*_*hra 21

scikit 在 0.17 版本中创建了一个FunctionTransformer作为预处理类的一部分。它的使用方式与上面答案中 David 对 Fisher 类的实现类似,但灵活性较差。如果函数的输入/输出配置正确,变压器可以为函数实现 fit/transform/fit_transform 方法,从而允许它在 scikit 管道中使用。

例如,如果管道的输入是一个序列,则变压器将如下所示:


def trans_func(input_series):
    return output_series

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)

sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)
Run Code Online (Sandbox Code Playgroud)

其中 vect 是 tf_idf 转换器,clf 是分类器,train 是训练数据集。“train.desc”是输入到管道的系列文本。


Dav*_*vid 7

我不知道你的SMOTE()Fisher()函数来自哪里,但答案是肯定的,你绝对可以做到这一点。为此,您需要围绕这些函数编写一个包装类。最简单的方法是继承 sklearnBaseEstimatorTransformerMixin类,请参见示例:http : //scikit-learn.org/stable/auto_examples/hetero_feature_union.html

如果这对您没有意义,请发布至少一个函数的详细信息(它来自的库或您自己编写的代码),我们可以从那里开始。

编辑:

我很抱歉,我没有仔细观察你的函数,以意识到除了你的训练数据(即 X 和 y)之外,它们还转换了你的目标。Pipeline 不支持转换到您的目标,因此您将像原来一样事先进行转换。供您参考,这里是为您的 Fisher 过程编写自定义类的样子,如果函数本身不需要影响您的目标变量,该类将起作用。

>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>> 
>>> class Fisher(BaseEstimator, TransformerMixin):
...     def __init__(self,percentile=0.95):
...             self.percentile = percentile
...     def fit(self, X, y):
...             from numpy import shape, argsort, ceil
...             X_pos, X_neg = X[y==1], X[y==0]
...             X_mean = X.mean(axis=0)
...             X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
...             deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
...             num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
...             F = num/deno
...             sort_F = argsort(F)[::-1]
...             n_feature = (float(self.percentile)/100)*shape(X)[1]
...             self.ind_feature = sort_F[:ceil(n_feature)]
...             return self
...     def transform(self, x):
...             return x[self.ind_feature,:]
... 
>>> 
>>> data = load_iris()
>>> 
>>> pipeline = Pipeline([
...     ('fisher', Fisher()),
...     ('normal',StandardScaler()),
...     ('svm',SVC(class_weight='auto'))
... ])
>>> 
>>> grid = {
...     'fisher__percentile':[0.75,0.50],
...     'svm__C':[1,2]
... }
>>> 
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
    return self._fit(X, y, ParameterGrid(self.param_grid))
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
    for parameters in parameter_iterable
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
    self.dispatch(function, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
    job = ImmediateApply(func, args, kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
    self.results = func(*args, **kwargs)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    self.steps[-1][-1].fit(Xt, y, **fit_params)
  File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
    (X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.
Run Code Online (Sandbox Code Playgroud)