使用 python sklearn 进行逻辑回归和 GridSearchCV

Question

使用 python sklearn 进行逻辑回归和 GridSearchCV

use*_*622 8 python scikit-learn logistic-regression

我正在尝试此页面中的代码。我跑到这个部分LR (tf-idf)并得到了类似的结果

之后我决定尝试一下GridSearchCV。我的问题如下：

1）

#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)

#tuned hpyerparameters :(best parameters)  {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823

Run Code Online (Sandbox Code Playgroud)

然后我手动计算了f1分数。为什么不匹配？

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489

Run Code Online (Sandbox Code Playgroud)

如果我尝试scoring='precision'为什么会出现以下错误？我不清楚主要是因为我有相对平衡的数据集（55-45％）并且f1需要precision计算没有任何问题

#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)



/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters)  {'C': 0.1, 'penalty': 'l2'}
best score : 0.9474200393672962

Run Code Online (Sandbox Code Playgroud)

有没有更简单的方法来获取列车数据的预测？我们已经有了这个 logreg_cv对象。我使用下面的方法来获取预测。有更好的方法来做同样的事情吗？

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]

############################

############更新1

请回答上面的问题1。在问题的评论中说The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search

这就是我得到以下结果的原因#tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'} #best score : 0.7390325593588823

但是当我手动做时我得到 f1_score(y_train, final_prediction) 0.9839388145315489

2）

我尝试f1_micro按照下面答案中的建议进行调整。没有错误消息。我仍然不清楚为什么失败f1_micro时没有失败precision

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[{ 0:0.95, 1:0.05 }, { 0:0.55, 1:0.45 }, { 0:0.45, 1:0.55 },{ 0:0.05, 1:0.95 }]}# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

tuned hpyerparameters :(best parameters)  {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'newton-cg'}
best score : 0.7894909688013136

Run Code Online (Sandbox Code Playgroud)

Answer 1

Stu*_*olf 5

你最终会得到精确的错误，因为你的一些惩罚对于这个模型来说太强了，如果你检查结果，当 C = 0.001 和 C = 0.01 时，你的 f1 分数为 0

res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]
 
                           params  split0_test_score  split1_test_score  split2_test_score
0   {'C': 0.001, 'penalty': 'l2'}           0.000000           0.000000           0.000000
1    {'C': 0.01, 'penalty': 'l2'}           0.000000           0.000000           0.000000
2     {'C': 0.1, 'penalty': 'l2'}           0.973568           0.952607           0.952174
3     {'C': 1.0, 'penalty': 'l2'}           0.863934           0.851064           0.836449
4    {'C': 10.0, 'penalty': 'l2'}           0.811634           0.769547           0.787838
5   {'C': 100.0, 'penalty': 'l2'}           0.789826           0.762162           0.773438
6  {'C': 1000.0, 'penalty': 'l2'}           0.781003           0.750000           0.763871

Run Code Online (Sandbox Code Playgroud)

你可以检查一下：

lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])

Run Code Online (Sandbox Code Playgroud)

预测的概率会向截距漂移：

# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])

lr.predict_proba(X_train_vectors_tfidf)
 
array([[0.58732636, 0.41267364],
       [0.57074279, 0.42925721],
       [0.57219143, 0.42780857],
       ...,
       [0.57215605, 0.42784395],
       [0.56988186, 0.43011814],
       [0.58966184, 0.41033816]])

Run Code Online (Sandbox Code Playgroud)

对于“获取列车数据的预测”的问题，我认为这是唯一的方法。使用最佳参数在整个训练集上重新拟合模型，但不存储预测或预测概率。如果您正在寻找训练/测试期间获得的值，您可以检查cross_val_predict

归档时间：	4 年，5 月前
查看次数：	5109 次
最近记录：	4 年，5 月前