为什么“值”之和不等于 scikit-learn RandomForestClassifier 中“样本”的数量？

Question

为什么“值”之和不等于 scikit-learn RandomForestClassifier 中“样本”的数量？

Rya*_*yan 4 python machine-learning decision-tree random-forest scikit-learn

我通过 RandomForestClassifier 构建了一个随机森林并绘制了决策树。参数“值”（红色箭头所指）是什么意思？为什么[]中两个数字的总和不等于“样本”的数量？我看到了其他一些例子，[]中的两个数字之和等于“样本”的数量。为什么我的情况没有？

df = pd.read_csv("Dataset.csv")
df.drop(['Flow ID', 'Inbound'], axis=1, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace = True)
df.Label[df.Label == 'BENIGN'] = 0
df.Label[df.Label == 'DrDoS_LDAP'] = 1
Y = df["Label"].values
Y = Y.astype('int')
X = df.drop(labels = ["Label"], axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.5)
model = RandomForestClassifier(n_estimators = 20)
model.fit(X_train, Y_train)
Accuracy = model.score(X_test, Y_test)
        
for i in range(len(model.estimators_)):
    fig = plt.figure(figsize=(15,15))
    tree.plot_tree(model.estimators_[i], feature_names = df.columns, class_names = ['Benign', 'DDoS'])
    plt.savefig('.\\TheForest\\T'+str(i))

Run Code Online (Sandbox Code Playgroud)

Answer 1

des*_*aut 5

不错的收获。

虽然没有记录，但这是由于随机森林模型中默认进行引导采样（请参阅我的答案Why is Random Forest with a single tree much better than a Decision Tree classifier?有关 RF 算法详细信息及其差异的更多信息来自仅仅是“一堆”决策树）。

让我们看一个数据示例iris：

from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

iris = load_iris()

rf = RandomForestClassifier(max_depth = 3)
rf.fit(iris.data, iris.target)

tree.plot_tree(rf.estimators_[0]) # take the first tree

Run Code Online (Sandbox Code Playgroud)

这里的结果与您报告的类似：对于除右下节点之外的所有其他节点，sum(value)都不等于samples，因为它应该是“简单”决策树的情况。

谨慎的观察者可能会注意到这里看起来很奇怪的其他事情：虽然 iris 数据集有 150 个样本：

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class

Run Code Online (Sandbox Code Playgroud)

树的基节点应该包括所有这些，samples第一个节点的基节点只有 89。

为什么会这样？到底发生了什么？为了看看，让我们拟合第二个 RF 模型，这次没有引导采样（即使用bootstrap=False）：

rf2 = RandomForestClassifier(max_depth = 3, bootstrap=False) # no bootstrap sampling
rf2.fit(iris.data, iris.target)

tree.plot_tree(rf2.estimators_[0]) # take again the first tree

Run Code Online (Sandbox Code Playgroud)

value好吧，既然我们已经禁用了引导采样，一切看起来都“很好”：每个节点中的总和等于samples，并且基本节点确实包含整个数据集（150 个样本）。

因此，您描述的行为似乎确实是由于引导采样造成的，在创建带有替换的样本时（即最终为集合的每个单独的决策树提供了重复的样本），这些重复的样本不会反映sample在树节点，显示唯一样本的数量；尽管如此，它还是反映在节点中value。

这种情况与 RF 回归模型以及 Bagging 分类器完全类似 - 分别参见：

归档时间：	3 年，10 月前
查看次数：	958 次
最近记录：	3 年，9 月前