tkj*_*kja 1 python scikit-learn
我试图了解如何获取GridSearchCV的得分者的值.下面的示例代码在文本数据上设置了一个小管道.
然后它在不同的ngrams上设置网格搜索.
评分是通过f1测量完成的:
#setup the pipeline
tfidf_vec = TfidfVectorizer(analyzer='word', min_df=0.05, max_df=0.95)
linearsvc = LinearSVC()
clf = Pipeline([('tfidf_vec', tfidf_vec), ('linearsvc', linearsvc)])
# setup the grid search
parameters = {'tfidf_vec__ngram_range': [(1, 1), (1, 2)]}
gs_clf = GridSearchCV(clf, parameters, n_jobs=-1, scoring='f1')
gs_clf = gs_clf.fit(docs_train, y_train)
Run Code Online (Sandbox Code Playgroud)
现在我可以打印得分:
print gs_clf.grid_scores_
[mean: 0.81548, std: 0.01324, params: {'tfidf_vec__ngram_range': (1, 1)},
mean: 0.82143, std: 0.00538, params: {'tfidf_vec__ngram_range': (1, 2)}]
Run Code Online (Sandbox Code Playgroud)
print gs_clf.grid_scores_ [0] .cv_validation_scores
array([ 0.83234714, 0.8 , 0.81409002])
Run Code Online (Sandbox Code Playgroud)
从文档中我不清楚:
是gs_clf.grid_scores_ [0] .cv_validation_scores一个数组,其中每个折叠通过评分参数定义得分(在这种情况下,每折次f1度量)?如果没有,那么它是什么?
如果我改为选择另一个度量标准,例如scoring ='f1_micro',gs_clf.grid_scores_ [i] .cv_validation_scores中的每个数组都将包含特定网格搜索参数选择的折叠的f1_micro度量标准?
我编写了以下函数将grid_scores_对象转换为pandas.DataFrame.希望数据框视图有助于消除您的困惑,因为它是一种更直观的格式:
def grid_scores_to_df(grid_scores):
"""
Convert a sklearn.grid_search.GridSearchCV.grid_scores_ attribute to a tidy
pandas DataFrame where each row is a hyperparameter-fold combinatination.
"""
rows = list()
for grid_score in grid_scores:
for fold, score in enumerate(grid_score.cv_validation_scores):
row = grid_score.parameters.copy()
row['fold'] = fold
row['score'] = score
rows.append(row)
df = pd.DataFrame(rows)
return df
Run Code Online (Sandbox Code Playgroud)
你必须有以下导入才能工作:import pandas as pd.