管道中的python特征选择:如何确定特征名称？

Question

管道中的python特征选择:如何确定特征名称？

fig*_*ggy 6 pipeline feature-selection scikit-learn

我使用管道和grid_search来选择最佳参数,然后使用这些参数来拟合最佳管道('best_pipe').但是,由于feature_selection(SelectKBest)在管道中,所以没有适用于SelectKBest.

我需要知道'k'所选功能的功能名称.有任何想法如何检索它们？先感谢您

from sklearn import (cross_validation, feature_selection, pipeline,
                     preprocessing, linear_model, grid_search)
folds = 5
split = cross_validation.StratifiedKFold(target, n_folds=folds, shuffle = False, random_state = 0)

scores = []
for k, (train, test) in enumerate(split):

    X_train, X_test, y_train, y_test = X.ix[train], X.ix[test], y.ix[train], y.ix[test]

    top_feat = feature_selection.SelectKBest()

    pipe = pipeline.Pipeline([('scaler', preprocessing.StandardScaler()),
                                 ('feat', top_feat),
                                 ('clf', linear_model.LogisticRegression())])

    K = [40, 60, 80, 100]
    C = [1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
    penalty = ['l1', 'l2']

    param_grid = [{'feat__k': K,
                  'clf__C': C,
                  'clf__penalty': penalty}]

    scoring = 'precision'

    gs = grid_search.GridSearchCV(estimator=pipe, param_grid = param_grid, scoring = scoring)
    gs.fit(X_train, y_train)

    best_score = gs.best_score_
    scores.append(best_score)

    print "Fold: {} {} {:.4f}".format(k+1, scoring, best_score)
    print gs.best_params_

Run Code Online (Sandbox Code Playgroud)

best_pipe = pipeline.Pipeline([('scale', preprocessing.StandardScaler()),
                          ('feat', feature_selection.SelectKBest(k=80)),
                          ('clf', linear_model.LogisticRegression(C=.0001, penalty='l2'))])

best_pipe.fit(X_train, y_train)
best_pipe.predict(X_test)

Run Code Online (Sandbox Code Playgroud)

Answer 1

jak*_*vdp 6

您可以通过以下名称访问功能选择器best_pipe:

features = best_pipe.named_steps['feat']

Run Code Online (Sandbox Code Playgroud)

然后,您可以调用transform()索引数组以获取所选列的名称:

X.columns[features.transform(np.arange(len(X.columns)))]

Run Code Online (Sandbox Code Playgroud)

此处的输出将是管道中选择的八十列名称.

Answer 2

bwe*_*t87 6

杰克的回答完全有效。但是根据您使用的功能选择器，我认为还有另一种选择更简洁。这个对我有用：

X.columns[features.get_support()]

Run Code Online (Sandbox Code Playgroud)

它给了我与杰克答案相同的答案。你可以在docs 中看到更多关于它的信息，但get_support返回一个真/假值数组，以确定是否使用了该列。此外，值得注意的是，X必须与特征选择器上使用的训练数据具有相同的形状。

Answer 3

xim*_*iki 5

这可能是一个有启发性的替代方案：我遇到了与 OP 所要求的类似的需求。如果想直接从GridSearchCV以下位置获得 k 个最佳特征的索引：

finalFeatureIndices = gs.best_estimator_.named_steps["feat"].get_support(indices=True)

Run Code Online (Sandbox Code Playgroud)

并通过索引操作，可以获得您的finalFeatureList：

finalFeatureList = [initialFeatureList[i] for i in finalFeatureIndices]

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，2 月前
查看次数：	7472 次
最近记录：	8 年，4 月前