EmJ*_*EmJ 7 python classification machine-learning scikit-learn cross-validation
我使用RandomForestClassifier()与10 fold cross validation如下。
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print(accuracy.mean())
Run Code Online (Sandbox Code Playgroud)
我想确定特征空间中的重要特征。获得单个分类的特征重要性似乎很简单,如下所示。
print("Features sorted by their score:")
feature_importances = pd.DataFrame(clf.feature_importances_,
index = X_train.columns,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
Run Code Online (Sandbox Code Playgroud)
但是,我怎么也找不到执行feature importance对cross validation在sklearn。
总之,我想average importance score在 10 次交叉验证中确定最有效的特征(例如,通过使用)。
如果需要,我很乐意提供更多详细信息。
cross_val_score() 不返回每个训练测试折叠组合的估计量。
您需要使用cross_validate()和设置return_estimator =True.
这是一个工作示例:
from sklearn import datasets
from sklearn.model_selection import cross_validate
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
clf=RandomForestClassifier(n_estimators =10, random_state = 42, class_weight="balanced")
output = cross_validate(clf, X, y, cv=2, scoring = 'accuracy', return_estimator =True)
Run Code Online (Sandbox Code Playgroud)
for idx,estimator in enumerate(output['estimator']):
print("Features sorted by their score for estimator {}:".format(idx))
feature_importances = pd.DataFrame(estimator.feature_importances_,
index = diabetes.feature_names,
columns=['importance']).sort_values('importance', ascending=False)
print(feature_importances)
Run Code Online (Sandbox Code Playgroud)
输出:
Features sorted by their score for estimator 0:
importance
s6 0.137735
age 0.130152
s5 0.114561
s2 0.113683
s3 0.112952
bmi 0.111057
bp 0.108682
s1 0.090763
s4 0.056805
sex 0.023609
Features sorted by their score for estimator 1:
importance
age 0.129671
bmi 0.125706
s2 0.125304
s1 0.113903
bp 0.111979
s6 0.110505
s5 0.106099
s3 0.098392
s4 0.054542
sex 0.023900
Run Code Online (Sandbox Code Playgroud)