如何使用 GridSearchCV 比较多个模型以及 python 中的管道和超参数调整

Question

如何使用 GridSearchCV 比较多个模型以及 python 中的管道和超参数调整

Lij*_*raj 4 python pipeline python-3.x scikit-learn gridsearchcv

我使用两个估计器，随机森林和 SVM

random_forest_pipeline=Pipeline([   
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('random_forest',RandomForestClassifier())
])
svm_pipeline=Pipeline([
    ('vectorizer',CountVectorizer(stop_words='english')),
    ('svm',LinearSVC())
])

Run Code Online (Sandbox Code Playgroud)

我想首先对数据进行矢量化，然后使用估计器，我正在阅读这个在线教程。然后我使用超参数如下

parameters=[
    {
        'vectorizer__max_features':[500,1000,1500],
        'random_forest__min_samples_split':[50,100,250,500]
    },
    {
        'vectorizer__max_features':[500,1000,1500],
        'svm__C':[1,3,5]
    }
]

Run Code Online (Sandbox Code Playgroud)

并传递给GridSearchCV

pipelines=[random_forest_pipeline,svm_pipeline]
grid_search=GridSearchCV(pipelines,param_grid=parameters,cv=3,n_jobs=-1)
grid_search.fit(x_train,y_train)

Run Code Online (Sandbox Code Playgroud)

但是，当我运行代码时出现错误

类型错误：估计器应该是实现“fit”方法的估计器

不知道为什么我会收到此错误

Answer 1

Ven*_*lam 6

根据此处的示例，很有可能在单个Pipeline/中完成此操作。GridSearchCV

您只需明确提及scoring管道的方法，因为我们最初并未声明最终的估计器。

例子：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC


my_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('clf', 'passthrough')
])


parameters = [
    {
        'vectorizer__max_features': [500, 1000],
        'clf':[RandomForestClassifier()],
        'clf__min_samples_split':[50, 100,]
    },
    {
        'vectorizer__max_features': [500, 1000],
        'clf':[LinearSVC()],
        'clf__C':[1, 3]
    }
]

grid_search = GridSearchCV(my_pipeline, param_grid=parameters, cv=3, n_jobs=-1, scoring='accuracy')
grid_search.fit(X, y)

grid_search.best_params_

> # {'clf': RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
#                         criterion='gini', max_depth=None, max_features='auto',
#                         max_leaf_nodes=None, max_samples=None,
#                         min_impurity_decrease=0.0, min_impurity_split=None,
#                         min_samples_leaf=1, min_samples_split=100,
#                         min_weight_fraction_leaf=0.0, n_estimators=100,
#                         n_jobs=None, oob_score=False, random_state=None,
#                         verbose=0, warm_start=False),
#  'clf__min_samples_split': 100,
#  'vectorizer__max_features': 1000}




pd.DataFrame(grid_search.cv_results_)[['param_vectorizer__max_features',
                                       'param_clf__min_samples_split',
                                       'param_clf__C','mean_test_score',
                                       'rank_test_score']]

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，8 月前
查看次数：	3708 次
最近记录：	4 年，5 月前