Scikit-Learn GridSearch自定义评分功能

Question

Scikit-Learn GridSearch自定义评分功能

我需要在维度（5000，26421）的数据集上执行内核pca，以获得较低维度的表示形式。为了选择分量数（例如k）参数，我将数据简化并重建到原始空间，并针对k的不同值获取重建数据和原始数据的均方误差。

我遇到了sklearn的gridsearch功能，并希望将其用于上述参数估计。由于内核pca没有评分功能，因此我实现了一个自定义评分功能，并将其传递给Gridsearch。

from sklearn.decomposition.kernel_pca import KernelPCA
from sklearn.model_selection import GridSearchCV
import numpy as np
import math

def scorer(clf, X):
    Y1 = clf.inverse_transform(X)
    error = math.sqrt(np.mean((X - Y1)**2))
    return error

param_grid = [
    {'degree': [1, 10], 'kernel': ['poly'], 'n_components': [100, 400, 100]},
    {'gamma': [0.001, 0.0001], 'kernel': ['rbf'], 'n_components': [100, 400, 100]},
]

kpca = KernelPCA(fit_inverse_transform=True, n_jobs=30)
clf = GridSearchCV(estimator=kpca, param_grid=param_grid, scoring=scorer)
clf.fit(X)

Run Code Online (Sandbox Code Playgroud)

但是，它导致以下错误：

/usr/lib64/python2.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X=array([[ 2.,  2.,  1., ...,  0.,  0.,  0.],
    ....,  0.,  1., ...,  0.,  0.,  0.]], dtype=float32), Y=array([[-0.05904257, -0.02796719,  0.00919842, ....        0.00148251, -0.00311711]], dtype=float32), precomp
uted=False, dtype=<type 'numpy.float32'>)
    117                              "for %d indexed." %
    118                              (X.shape[0], X.shape[1], Y.shape[0]))
    119     elif X.shape[1] != Y.shape[1]:
    120         raise ValueError("Incompatible dimension for X and Y matrices: "
    121                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 122                              X.shape[1], Y.shape[1]))
        X.shape = (1667, 26421)
        Y.shape = (112, 100)
    123 
    124     return X, Y
    125 
    126 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 26421 while Y.shape[1] == 100

Run Code Online (Sandbox Code Playgroud)

有人可以指出我到底在做什么错吗？

Answer 1

Moh*_*hif 6

评分功能的语法不正确。您只需传递分类器的predicted和truth值。因此，这就是您声明自定义评分功能的方式：

def my_scorer(y_true, y_predicted):
    error = math.sqrt(np.mean((y_true - y_predicted)**2))
    return error

Run Code Online (Sandbox Code Playgroud)

然后可以使用make_scorerSklearn中的函数将其传递给GridSearch，请确保相应地设置greater_is_better属性：

无论score_func是得分函数（默认值），这意味着高好，还是损失函数，这意味着低好。在后一种情况下，得分手对象将对-的结果进行签名翻转score_func。

我假设您正在计算错误，因此此属性应设置为False，因为错误越少越好：

from sklearn.metrics import make_scorer
my_func = make_scorer(my_scorer, greater_is_better=False)

Run Code Online (Sandbox Code Playgroud)

然后将其传递给GridSearch：

GridSearchCV(estimator=my_clf, param_grid=param_grid, scoring=my_func)

Run Code Online (Sandbox Code Playgroud)

my_clf您的分类器在哪里。

还有一件事，我认为这并不是GridSearchCV您要找的东西。它基本上接受训练和测试拆分形式的数据。但是在这里，您只想转换输入数据。您需要在Sklearn中使用Pipeline。请看这里提到的结合PCA和GridSearchCV 的示例。

归档时间：	8 年，2 月前
查看次数：	2755 次
最近记录：	6 年，7 月前