我正在使用xgboost来构建模型,并试图找到每个功能的重要性get_fscore(),但它会返回{}
我的火车代码是:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
Run Code Online (Sandbox Code Playgroud)
我的火车有没有错?如何在xgboost中获得功能重要性?
MLK*_*ing 19
在您的代码中,您可以获得dict格式的每个功能的功能重要性:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Run Code Online (Sandbox Code Playgroud)
说明:train()API的方法get_score()定义为:
get_score(fmap ='',importance_type ='weight')
https://xgboost.readthedocs.io/en/latest/python/python_api.html
Ses*_*ism 12
使用sklearn API和XGBoost> = 0.81:
clf.get_booster().get_score(importance_type="gain")
Run Code Online (Sandbox Code Playgroud)
要么
regr.get_booster().get_score(importance_type="gain")
Run Code Online (Sandbox Code Playgroud)
为了使其正常工作,当您调用regr.fit(或clf.fit)时,X必须为pandas.DataFrame。
小智 7
试试这个
fscore = clf.best_estimator_.booster().get_fscore()
Run Code Online (Sandbox Code Playgroud)
我当然不知道如何获得价值,但有一个很好的方法来绘制特征重要性:
model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
Run Code Online (Sandbox Code Playgroud)
对于功能重要性,请尝试以下操作:
分类:
pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Run Code Online (Sandbox Code Playgroud)
回归:
xgb.plot_importance(bst)
Run Code Online (Sandbox Code Playgroud)
首先从XGboost构建模型
from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
Run Code Online (Sandbox Code Playgroud)
这将导致一个数组。所以我们可以用降序排序
sorted_idx = np.argsort(model.feature_importances_)[::-1]
Run Code Online (Sandbox Code Playgroud)
然后,是时候将所有排序的重要性和列的名称一起打印为列表了(我假设数据已加载Pandas)
for index in sorted_idx:
print([train.columns[index], model.feature_importances_[index]])
Run Code Online (Sandbox Code Playgroud)
此外,我们可以使用XGboost内置函数来绘制重要性
plot_importance(model, max_num_features = 15)
pyplot.show()
Run Code Online (Sandbox Code Playgroud)
如果需要max_num_features,plot_importance可以使用来限制功能的数量。
According to this post there 3 different ways to get feature importance from Xgboost:
Code example:
xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
Run Code Online (Sandbox Code Playgroud)
Please be aware of what type of feature importance you are using. There are several types of importance, see the docs. The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type.
perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
Run Code Online (Sandbox Code Playgroud)
This is my preferred way to compute the importance. However, it can fail in case highly colinear features, so be careful! It's using permutation_importance from scikit-learn.
explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
Run Code Online (Sandbox Code Playgroud)
To use the above code, you need to have shap package installed.
I was running the example analysis on Boston data (house price regression from scikit-learn). Below 3 feature importance:
All plots are for the same model! As you see, there is a difference in the results. I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).
对于使用xgb.XGBRegressor()解决方法时遇到此问题的任何人,都是将数据保留为pandas.DataFrame()或numpy.array()而不是将数据转换为dmatrix()。另外,我必须确保gamma未为XGBRegressor指定参数。
fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
Run Code Online (Sandbox Code Playgroud)
拟合后,回归器fit.feature_importances_将返回一个权重数组,我假设该权重数组的顺序与熊猫数据框的特征列相同。
我当前的设置是Ubuntu 16.04,Anaconda发行版,python 3.6,xgboost 0.6和sklearn 18.1。
获取包含分数和要素名称的表,然后将其绘制。
feature_important = model.get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
Run Code Online (Sandbox Code Playgroud)
例如:
| 归档时间: |
|
| 查看次数: |
42384 次 |
| 最近记录: |