如何在xgboost中获得功能重要性?

mod*_*kzs 17 python xgboost

我正在使用xgboost来构建模型,并试图找到每个功能的重要性get_fscore(),但它会返回{}

我的火车代码是:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
Run Code Online (Sandbox Code Playgroud)

我的火车有没有错?如何在xgboost中获得功能重要性?

MLK*_*ing 19

在您的代码中,您可以获得dict格式的每个功能的功能重要性:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}
Run Code Online (Sandbox Code Playgroud)

说明:train()API的方法get_score()定义为:

get_score(fmap ='',importance_type ='weight')

  • fmap(str(optional)) - 要素图文件的名称.
  • importance_type
    • 'weight' - 用于在所有树中拆分数据的功能的次数.
    • 'gain' - 使用该功能的所有分割的平均增益.
    • 'cover' - 使用该功能的所有分割的平均覆盖率.
    • 'total_gain' - 使用该功能的所有分割的总增益.
    • 'total_cover' - 使用该功能的所有拆分的总覆盖范围.

https://xgboost.readthedocs.io/en/latest/python/python_api.html

  • @arash你需要使用 `bst.get_booster().get_score(importance_type='gain')` 代替 (4认同)
  • 为什么会出现以下错误:AttributeError:“XGBClassifier”对象没有属性“get_score”@MLKing (3认同)

Ses*_*ism 12

使用sklearn API和XGBoost> = 0.81:

clf.get_booster().get_score(importance_type="gain")
Run Code Online (Sandbox Code Playgroud)

要么

regr.get_booster().get_score(importance_type="gain")
Run Code Online (Sandbox Code Playgroud)

为了使其正常工作,当您调用regr.fit(或clf.fit)时,X必须为pandas.DataFrame

  • 出于某种原因,xgboost 似乎破坏了 model.feature_importances_ 所以这就是我正在寻找的。谢谢你。 (2认同)

小智 7

试试这个

fscore = clf.best_estimator_.booster().get_fscore()
Run Code Online (Sandbox Code Playgroud)


Kir*_*tov 7

我当然不知道如何获得价值,但有一个很好的方法来绘制特征重要性:

model = xgb.train(params, d_train, 1000, watchlist)
fig, ax = plt.subplots(figsize=(12,18))
xgb.plot_importance(model, max_num_features=50, height=0.8, ax=ax)
plt.show()
Run Code Online (Sandbox Code Playgroud)


Roo*_*beh 6

对于功能重要性,请尝试以下操作:

分类:

pd.DataFrame(bst.get_fscore().items(), columns=['feature','importance']).sort_values('importance', ascending=False)
Run Code Online (Sandbox Code Playgroud)

回归:

xgb.plot_importance(bst)
Run Code Online (Sandbox Code Playgroud)

  • 这些解决方案目前都无法使用。由于某些原因,模型会丢失功能名称,并返回空dict。 (3认同)

Ste*_* Hu 6

首先从XGboost构建模型

from xgboost import XGBClassifier, plot_importance
model = XGBClassifier()
model.fit(train, label)
Run Code Online (Sandbox Code Playgroud)

这将导致一个数组。所以我们可以用降序排序

sorted_idx = np.argsort(model.feature_importances_)[::-1]
Run Code Online (Sandbox Code Playgroud)

然后,是时候将所有排序的重要性和列的名称一起打印为列表了(我假设数据已加载Pandas)

for index in sorted_idx:
    print([train.columns[index], model.feature_importances_[index]]) 
Run Code Online (Sandbox Code Playgroud)

此外,我们可以使用XGboost内置函数来绘制重要性

plot_importance(model, max_num_features = 15)
pyplot.show()
Run Code Online (Sandbox Code Playgroud)

如果需要max_num_featuresplot_importance可以使用来限制功能的数量。


ppl*_*ski 6

According to this post there 3 different ways to get feature importance from Xgboost:

  • use built-in feature importance,
  • use permutation based importance,
  • use shap based importance.

Built-in feature importance

Code example:

xgb = XGBRegressor(n_estimators=100)
xgb.fit(X_train, y_train)
sorted_idx = xgb.feature_importances_.argsort()
plt.barh(boston.feature_names[sorted_idx], xgb.feature_importances_[sorted_idx])
plt.xlabel("Xgboost Feature Importance")
Run Code Online (Sandbox Code Playgroud)

Please be aware of what type of feature importance you are using. There are several types of importance, see the docs. The scikit-learn like API of Xgboost is returning gain importance while get_fscore returns weight type.

Permutation based importance

perm_importance = permutation_importance(xgb, X_test, y_test)
sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(boston.feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
Run Code Online (Sandbox Code Playgroud)

This is my preferred way to compute the importance. However, it can fail in case highly colinear features, so be careful! It's using permutation_importance from scikit-learn.

SHAP based importance

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
Run Code Online (Sandbox Code Playgroud)

To use the above code, you need to have shap package installed.

I was running the example analysis on Boston data (house price regression from scikit-learn). Below 3 feature importance:

Built-in importance

内置 xgboost 重要性

Permutation based importance

排列重要性

SHAP importance

小鬼

All plots are for the same model! As you see, there is a difference in the results. I prefer permutation-based importance because I have a clear picture of which feature impacts the performance of the model (if there is no high collinearity).


BCR*_*BCR 5

对于使用xgb.XGBRegressor()解决方法时遇到此问题的任何人,都是将数据保留为pandas.DataFrame()numpy.array()而不是将数据转换为dmatrix()。另外,我必须确保gamma未为XGBRegressor指定参数。

fit = alg.fit(dtrain[ft_cols].values, dtrain['y'].values)
ft_weights = pd.DataFrame(fit.feature_importances_, columns=['weights'], index=ft_cols)
Run Code Online (Sandbox Code Playgroud)

拟合后,回归器fit.feature_importances_将返回一个权重数组,我假设该权重数组的顺序与熊猫数据框的特征列相同。

我当前的设置是Ubuntu 16.04,Anaconda发行版,python 3.6,xgboost 0.6和sklearn 18.1。


Cat*_*lts 5

获取包含分数要素名称的表,然后将其绘制。

feature_important = model.get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.plot(kind='barh')
Run Code Online (Sandbox Code Playgroud)

例如:

在此处输入图片说明