我在Python中阅读以下关于Pipelines和GridSearchCV的示例:http://www.davidsbatista.net/blog/2017/04/01/document_classification/
Logistic回归:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LogisticRegression(solver='sag')),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Run Code Online (Sandbox Code Playgroud)
SVM:
pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC()),
])
parameters = {
'tfidf__max_df': (0.25, 0.5, 0.75),
'tfidf__ngram_range': [(1, 1), (1, 2), (1, 3)],
"clf__estimator__C": [0.01, 0.1, 1],
"clf__estimator__class_weight": ['balanced', None],
}
Run Code Online (Sandbox Code Playgroud)
有没有一种方法可以将Logistic回归和SVM组合成一个管道?比方说,我有一个TfidfVectorizer,喜欢测试多个分类器,然后每个分类器输出最好的模型/参数.
在sklearn中,可以定义串行管道,以使管道的所有连续部分都获得超参数的最佳组合。串行管道可以实现如下:
from sklearn.svm import SVC
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
digits = datasets.load_digits()
X_train = digits.data
y_train = digits.target
#Use Principal Component Analysis to reduce dimensionality
# and improve generalization
pca = decomposition.PCA()
# Use a linear SVC
svm = SVC()
# Combine PCA and SVC to a pipeline
pipe = Pipeline(steps=[('pca', pca), ('svm', svm)])
# Check the training time for the SVC
n_components = [20, 40, 64]
params_grid = {
'svm__C': …Run Code Online (Sandbox Code Playgroud)