我正在使用sklearn使用RFECV模块通过交叉验证执行递归功能消除.RFE涉及在全套特征上重复训练估计器,然后移除信息量最少的特征,直到收敛到最佳数量的特征.
为了通过估算器获得最佳性能,我想为每个特征数量选择最佳超参数(为清晰起见而编辑).估计器是一个线性SVM,所以我只关注C参数.
最初,我的代码如下.但是,这只是在开始时对C进行了一次网格搜索,然后在每次迭代时使用相同的C.
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn import svm, grid_search
def get_best_feats(data,labels,c_values):
parameters = {'C':c_values}
# svm1 passed to clf which is used to grid search the best parameters
svm1 = SVC(kernel='linear')
clf = grid_search.GridSearchCV(svm1, parameters, refit=True)
clf.fit(data,labels)
#print 'best gamma',clf.best_params_['gamma']
# svm2 uses the optimal hyperparameters from svm1
svm2 = svm.SVC(C=clf.best_params_['C'], kernel='linear')
#svm2 is then passed to RFECVv as the estimator for recursive feature elimination
rfecv = RFECV(estimator=svm2, step=1, cv=StratifiedKFold(labels, 5)) …Run Code Online (Sandbox Code Playgroud) 我正在使用递归特征估计(RFE)进行特征选择。其工作原理是迭代地采用估计器(例如 SVM 分类器),将其拟合到数据,并删除权重(系数)最低的特征。
我能够将其拟合到数据并执行特征选择。然而,我随后想从 RFE 中恢复每个特征的学习权重。
我使用以下代码来初始化分类器对象和 RFE 对象,并将它们拟合到数据中。
svc = SVC(C=1, kernel="linear")
rfe = RFE(estimator=svc, n_features_to_select=300, step=0.1)
rfe.fit(all_training, training_labels)
Run Code Online (Sandbox Code Playgroud)
然后我尝试打印系数
print ('coefficients',svc.coef_)
Run Code Online (Sandbox Code Playgroud)
并收到:
AttributeError: 'RFE' object has no attribute 'dual_coef_'
Run Code Online (Sandbox Code Playgroud)
根据sklearn 文档,分类器对象应该具有此属性:
coef_ : array, shape = [n_class-1, n_features]
Weights assigned to the features (coefficients in the primal problem). This is only
available in the case of a linear kernel.
coef_ is a readonly property derived from dual_coef_ and support_vectors_.
Run Code Online (Sandbox Code Playgroud)
我使用的是线性内核,所以这不是问题。
谁能解释为什么我无法恢复系数?有办法解决这个问题吗?
我正在尝试使用来自UCI机器学习库的CRX数据集.此特定数据集包含一些非连续变量的特征.因此,我需要将它们转换为数值,然后才能传递给SVM.
我最初研究使用单热解码器,它采用整数值并将它们转换为矩阵(例如,如果一个特征有三个可能的值,'红色''蓝色'和'绿色',这将被转换为三个二进制特征: "红色"为1,0,0,"蓝色"为"0,1,0","绿色"为"0,0".这对我的需求是理想的,除了它只能处理整数特征.
def get_crx_data(debug=False):
with open("/Volumes/LocalDataHD/jt306/crx.data", "rU") as infile:
features_array = []
reader = csv.reader(infile,dialect=csv.excel_tab)
for row in reader:
features_array.append(str(row).translate(None,"[]'").split(","))
features_array = np.array(features_array)
print features_array.shape
print features_array[0]
labels_array = features_array[:,15]
features_array = features_array[:,:15]
print features_array.shape
print labels_array.shape
print("FeatureHasher on frequency dicts")
hasher = FeatureHasher(n_features=44)
X = hasher.fit_transform(line for line in features_array)
print X.shape
get_crx_data()
Run Code Online (Sandbox Code Playgroud)
这回来了
Reading CRX data from disk
Traceback (most recent call last):
File"/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 38, in <module>
get_crx_data()
File "/Volumes/LocalDataHD/PycharmProjects/FeatureSelectionPython278/Crx2.py", line 32, in get_crx_data
X = …Run Code Online (Sandbox Code Playgroud) python numpy machine-learning feature-extraction scikit-learn