Scikit-learn - 使用RFECV和GridSearch减少功能.系数存储在哪里？

Question

Scikit-learn - 使用RFECV和GridSearch减少功能.系数存储在哪里？

我正在使用Scikit-learn RFECV为使用交叉验证的逻辑回归选择最重要的特征.假设X是特征的[n,x]数据帧,y代表响应变量:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

# 
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y)

Run Code Online (Sandbox Code Playgroud)

两个问题:

a)这是功能,超参数选择和拟合的正确过程吗？

b)在哪里可以找到所选特征的拟合系数？

Answer 1

Irm*_*rer 24

这是功能选择的正确过程吗？ 这是功能选择的众多方法之一.递归功能消除是一种自动化方法,scikit.learn文档中列出了其他功能.它们具有不同的优点和缺点,通常通过涉及常识和具有不同特征的尝试模型来实现特征选择.RFE是一种选择一组好功能的快捷方式,但并不一定能为您提供最佳功能.顺便说一句,您不需要单独构建StratifiedKFold.如果你只是设置cv参数cv=3,两者RFECV并GridSearchCV会自动使用StratifiedKFold如果y值是二进制或者多类,其中我假设是最有可能的情况下,因为你正在使用LogisticRegression.你也可以结合起来

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

Run Code Online (Sandbox Code Playgroud)

成

X_new = rfecv.fit_transform(X, y)

Run Code Online (Sandbox Code Playgroud)

这是超参数选择的正确过程吗？ GridSearchCV基本上是一种系统地尝试一组模型参数组合的自动化方法,并根据某些性能指标在这些组合中选择最佳.这是寻找合适参数的好方法,是的.

这是正确的装配过程吗？ 是的,这是拟合模型的有效方法.当你打电话时grid.fit(X_new, y),它会建立一个LogisticRegression估算器网格(每个都有一组经过尝试的参数)并适合每个估算器.它将保持具有最佳性能的grid.best_estimator_那个,该估计器的参数和该估计器grid.best_params_的性能分数grid.best_score_.它会自行返回,而不是最好的估算器.请记住,对于您将使用模型进行预测的传入新X值,您必须使用拟合的RFECV模型应用变换.因此,您实际上也可以将此步骤添加到管道中.

在哪里可以找到所选特征的拟合系数？ 的grid.best_estimator_属性是一个LogisticRegression用所有这些信息的对象,因此grid.best_estimator_.coef_具有所有的系数(和grid.best_estimator_.intercept_是截距).请注意,为了能够得到这个grid.best_estimator_,需要将refit参数on GridSearchCV设置为True,但这仍然是默认值.

Answer 2

Jia*_* Li 6

实际上,您需要为样本数据进行训练验证测试.其中列车集用于调整您的正常参数,用于在网格搜索中调整超参数的验证集,以及用于性能评估的测试集.这是一种方法.

from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd


# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])

X.shape

Out[78]: (1000, 100)

y.shape

Out[79]: (1000,)

# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
list(split)
# the 1st fold
train_index = list(split)[0][0]

Out[80]: array([  0,   1,   2, ..., 997, 998, 999])

test_index = list(split)[0][1]

Out[81]: array([  5,  12,  17, ..., 979, 982, 984])

# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
                    logisticregression__penalty=['l1', 'l2'],
                    logisticregression__class_weight=[None, 'auto'])

# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])

# to get the params of your estimator, call your gscv
grid.best_estimator_
Out[82]: 
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0))])


# the performance in validation set
grid.grid_scores_
Out[83]: 
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]


# or the best score among them
grid.best_score_
Out[84]: 0.94002188482393367

# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])

Out[87]: 
array([[ 0.0632,  0.9368],
       [ 0.0236,  0.9764],
       [ 0.0227,  0.9773],
       ..., 
       [ 0.0108,  0.9892],
       [ 0.2903,  0.7097],
       [ 0.0113,  0.9887]])

# to get evaluation result, 
print(classification_report(y[test_index], y_pred))

             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201



# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = []  # to store classificaiton report

split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)

for train_index, test_index in split:

    grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')

    grid.fit(X[train_index], y[train_index])

    y_pred_probs = grid.predict_proba(X[test_index])
    kfold_df.iloc[test_index, :] = y_pred_probs

    y_pred = grid.predict(X[test_index])
    report.append(classification_report(y[test_index], y_pred))

# your result
print(kfold_df)

Out[88]: 
          0       1
0    0.1710  0.8290
1    0.0083  0.9917
2    0.2049  0.7951
3    0.0038  0.9962
4    0.0536  0.9464
5    0.0632  0.9368
6    0.1243  0.8757
7    0.1150  0.8850
8    0.0796  0.9204
9    0.4096  0.5904
..      ...     ...
990  0.0505  0.9495
991  0.2128  0.7872
992  0.0270  0.9730
993  0.0434  0.9566
994  0.8078  0.1922
995  0.1452  0.8548
996  0.1372  0.8628
997  0.0127  0.9873
998  0.0935  0.9065
999  0.0065  0.9935

[1000 rows x 2 columns]


for r in report:
    print(r)

for r in report:
    print(r)
             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201

             precision    recall  f1-score   support

          0       0.86      0.55      0.67        22
          1       0.95      0.99      0.97       179

avg / total       0.94      0.94      0.93       201

             precision    recall  f1-score   support

          0       0.89      0.38      0.53        21
          1       0.93      0.99      0.96       179

avg / total       0.93      0.93      0.92       200

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，5 月前
查看次数：	14657 次
最近记录：	10 年，5 月前