scikit-learn中的穷举功能选择？

Question

scikit-learn中的穷举功能选择？

在scikit-learn中有没有内置的做暴力特征选择的方法？即彻底评估输入特征的所有可能组合,然后找到最佳子集.我熟悉"递归特征消除"类,但我特别感兴趣的是一个接一个地评估输入特征的所有可能组合.

Answer 1

结合 Fred Foo 的回答以及 nopper、ihadanny 和 jimijazz 的评论，以下代码获得与实验 1 中第一个示例的 R 函数 regsubsets() （leaps 库的一部分）相同的结果（6.5.1 最佳子集） Selection）在《R 中的统计学习及其应用简介》一书中。

from itertools import combinations
from sklearn.cross_validation import cross_val_score

def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
   estimator must have a fit and score functions.
   X must be a DataFrame.'''

    n_features = X.shape[1]
    subsets = (combinations(range(n_features), k + 1) 
               for k in range(min(n_features, max_size)))

    best_size_subset = []
    for subsets_k in subsets:  # for each list of subsets of the same size
        best_score = -np.inf
        best_subset = None
        for subset in subsets_k: # for each subset
            estimator.fit(X.iloc[:, list(subset)], y)
            # get the subset with the best score among subsets of the same size
            score = estimator.score(X.iloc[:, list(subset)], y)
            if score > best_score:
                best_score, best_subset = score, subset
        # to compare subsets of different sizes we must use CV
        # first store the best subset of each size
        best_size_subset.append(best_subset)

    # compare best subsets of each size
    best_score = -np.inf
    best_subset = None
    list_scores = []
    for subset in best_size_subset:
        score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
        list_scores.append(score)
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score, best_size_subset, list_scores

Run Code Online (Sandbox Code Playgroud)

请参阅笔记本http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

Answer 2

Fre*_*Foo 6

不,没有实现最佳子集选择.最简单的方法是自己编写.这应该让你开始:

from itertools import chain, combinations
from sklearn.cross_validation import cross_val_score

def best_subset_cv(estimator, X, y, cv=3):
    n_features = X.shape[1]
    subsets = chain.from_iterable(combinations(xrange(k), k + 1)
                                  for k in xrange(n_features))

    best_score = -np.inf
    best_subset = None
    for subset in subsets:
        score = cross_val_score(estimator, X[:, subset], y, cv=cv).mean()
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score

Run Code Online (Sandbox Code Playgroud)

这执行ķ循环内倍交叉验证,所以它适合ķ 2 ᵖ与给予数据时估计p特征.

代码中有错误.它应该是`组合(xrange(n_features))`. (2认同)

归档时间：	11 年，10 月前
查看次数：	3221 次
最近记录：	6 年，4 月前