随机森林特征重要性与相关矩阵

Question

随机森林特征重要性与相关矩阵

ran*_*ent 1 python machine-learning scikit-learn

我想看看变量之间的相关性。首先，我使用了相关矩阵。它向我展示了所有变量之间的相关性。然后我创建我的random forest regressor模型。在一篇文章中我发现它具有的功能feature_importances_。它讲述了自变量和因变量之间的相关性。所以我尝试了它，然后我看到它显示与相关矩阵的值相同的相关值。我的问题是，那么相关矩阵和随机森林特征重要性有什么区别？

Answer 1

ASH*_*ASH 5

查看下面的代码。

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Load boston housing dataset as an example
boston = load_boston()


X = boston["data"]
Y = boston["target"]
names = boston["feature_names"]
reg = RandomForestRegressor()
reg.fit(X, Y)
print("Features sorted by their score:")
print(sorted(zip(map(lambda x: round(x, 4), reg.feature_importances_), names), 
             reverse=True))


boston_pd = pd.DataFrame(boston.data)
print(boston_pd.head())

boston_pd.columns = boston.feature_names
print(boston_pd.head())

# correlations
boston_pd.corr()
import seaborn as sn
import matplotlib.pyplot as plt
corrMatrix = boston_pd.corr()
sn.heatmap(corrMatrix, annot=True)
plt.show()

Run Code Online (Sandbox Code Playgroud)

features = boston.feature_names
importances = reg.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='#8f63f4', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
plt.show()

Run Code Online (Sandbox Code Playgroud)

因此，特征选择依赖于相关性分析师来确定我们应该使用的最佳特征；哪些特征（自变量）对帮助确定目标变量（因变量）具有最大的统计影响。相关性是一个统计术语，指两个变量之间的线性关系有多接近。在执行任何机器学习任务时，特征选择是首要步骤之一，可以说是最重要的步骤之一。数据集中的一个特征是一列数据。在处理任何数据集时，我们必须了解哪一列（特征）将对输出变量产生统计上的显着影响。如果我们在模型中添加许多不相关的特征，只会使模型变得最糟糕（Garbage In Garbage Out）。这就是我们进行特征选择的原因。皮尔逊相关性（特征选择）在确定所有自变量相对于目标变量（因变量）的相关性时非常流行。

归档时间：	5 年，6 月前
查看次数：	6813 次
最近记录：	5 年，6 月前