以百分比形式打印特征重要性

Question

以百分比形式打印特征重要性

Ole*_*siy 2 python feature-selection lightgbm

我用 Python 拟合了基本的 LGBM 模型。

\n

# Create an instance\nLGBM = LGBMRegressor(random_state = 123, importance_type = \'gain\') # `split` can be also selected here\n\n# Fit the model (subset of data)\nLGBM.fit(X_train_subset, y_train_subset)\n\n# Predict y_pred\ny_pred = LGBM.predict(X_test)\n

Run Code Online (Sandbox Code Playgroud)\n

我正在查看文档：

\n

\n
important_type（字符串，可选（默认=“split”））\xe2\x80\x93 如何计算\n重要性。如果 \xe2\x80\x9csplit\xe2\x80\x9d，\n结果包含该特征在模型中使用的次数。如果\xe2\x80\x9cgain\xe2\x80\x9d，结果包含使用该功能的分割的总\n增益。
\n

\n

我使用了gain它，它打印了我的总收益。

\n

# Print features by importantce\npd.DataFrame([X_train.columns, LGBM.feature_importances_]).T.sort_values([1], ascending = [True])\n\n         0         1\n\n59  SLG_avg_p      0\n4   PA_avg         2995.8\n0   home           5198.55\n26  next_home      11824.2\n67  first_time_pitcher  15042.1\netc\n

Run Code Online (Sandbox Code Playgroud)\n

我试过：

\n

# get importance\nimportance = LGBM.feature_importances_\n# summarize feature importance\nfor i, v in enumerate(importance):\n    print(\'Feature: %0d, Score: %.5f\' % (i,v))\n# plot feature importance\nplt.bar([x for x in range(len(importance))], importance)\nplt.show()\n

Run Code Online (Sandbox Code Playgroud)\n

并接收值和绘图：

\n

Feature: 0, Score: 5198.55005\nFeature: 1, Score: 20688.87198\nFeature: 2, Score: 49147.90228\nFeature: 3, Score: 71734.03088\netc\n

Run Code Online (Sandbox Code Playgroud)\n

\n

我也尝试过：

\n

# feature importance\nprint(LGBM.feature_importances_)\n# plot\nplt.bar(range(len(LGBM.feature_importances_)), LGBM.feature_importances_)\nplt.show()\n

Run Code Online (Sandbox Code Playgroud)\n

如何打印该模型中的百分比？出于某种原因，我确信他们会自动计算它。

\n

Answer 1

Dav*_* M. 7

百分比选项在R 版本中可用，但在Python 版本中不可用。在Python中，你可以执行以下操作（使用一个虚构的示例，因为我没有你的数据）：

from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
import pandas as pd

X, y = make_regression(n_samples=1000, n_features=10, n_informative=10, random_state=1)
feature_names = [f'Feature {i+1}' for i in range(10)]
X = pd.DataFrame(X, columns=feature_names)

model = LGBMRegressor(importance_type='gain')
model.fit(X, y)

feature_importances = (model.feature_importances_ / sum(model.feature_importances_)) * 100

results = pd.DataFrame({'Features': feature_names,
                        'Importances': feature_importances})
results.sort_values(by='Importances', inplace=True)

ax = plt.barh(results['Features'], results['Importances'])
plt.xlabel('Importance percentages')
plt.show()

Run Code Online (Sandbox Code Playgroud)

输出：

归档时间：	5 年，3 月前
查看次数：	5846 次
最近记录：	5 年，2 月前