最佳模型的 GridSearch：保存和加载参数

Question

最佳模型的 GridSearch：保存和加载参数

Chr*_*her 5 python parameters pickle scikit-learn grid-search

我喜欢运行以下工作流程：

选择用于文本矢量化的模型
定义参数列表
在参数上应用带有 GridSearchCV 的管道，使用 LogisticRegression() 作为基线来寻找最佳模型参数
保存最佳模型（参数）
加载最佳模型参数，以便我们可以在此定义的模型上应用一系列其他分类器。

这是您可以重现的代码：

网格搜索：

%%time
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
                                                    data.label, random_state=0)

# Find best Tfidf model using LR
pipeline = Pipeline([
  ('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
  ('clf', LogisticRegression())
  ])

parameters = {
              'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
              'tfidf__smooth_idf': (True, False),
              'tfidf__norm': ('l1', 'l2', None),
              }

grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

print(grid.best_params_)

# Save model
#joblib.dump(grid.best_estimator_, 'best_tfidf.pkl', compress = 1) # this unfortunately includes the LogReg
joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

Run Code Online (Sandbox Code Playgroud)

为 24 个候选中的每一个拟合 2 折，总共 48 个拟合 {'tfidf__smooth_idf': True, 'tfidf__norm': 'l2', 'tfidf__max_df': 0.25}

加载具有最佳参数的模型：

from sklearn.model_selection import GridSearchCV

# Load best parameters
tfidf_params = joblib.load('best_tfidf.pkl')

pipeline = Pipeline([
  ('vec', TfidfVectorizer(preprocessor=' '.join, tokenizer=None).set_params(**tfidf_params)), # here is the issue?
  ('clf', LogisticRegression())
  ])

cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score: %s" % (np.mean(cval)))

Run Code Online (Sandbox Code Playgroud)

ValueError: 估计器 TfidfVectorizer 的参数 tfidf 无效(analyzer='word', binary=False, decode_error='strict', dtype=, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=, smooth_idf=True, stop_words=None, strip_accents=None, sublinear_tf=False, token_pattern='(?u)\b \w\w+\b'，分词器=无，use_idf=True，词汇=无）。使用来检查可用参数列表estimator.get_params().keys()。

题：

如何加载 Tfidf 模型的最佳参数？

Answer 1

Viv*_*mar 4

这行：

joblib.dump(grid.best_params_, 'best_tfidf.pkl', compress = 1) # Only best parameters

Run Code Online (Sandbox Code Playgroud)

保存的参数pipeline，而不是 TfidfVectorizer。所以这样做：

pipeline = Pipeline([
  # Change the name to be same as before
  ('tfidf', TfidfVectorizer(preprocessor=' '.join, tokenizer=None)),
  ('clf', LogisticRegression())
  ])

pipeline.set_params(**tfidf_params)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年前
查看次数：	5875 次
最近记录：	7 年前