了解梯度提升回归树的部分相关性

Question

了解梯度提升回归树的部分相关性

我正在看有关Python中部分依赖图的教程。本教程或文档中没有给出方程式。R函数的文档给出了我期望的公式：

对于Python教程中给出的结果，这似乎没有意义。如果它是房价预测的平均值，那么它又是负数又是多少？我期望数百万的价值。我想念什么吗？

更新：

为了进行回归，似乎从上述公式中减去了平均值。如何将其加回去？对于我训练有素的模型，我可以通过

from sklearn.ensemble.partial_dependence import partial_dependence
partial_dependence, independent_value = partial_dependence(model, features.index(independent_feature),X=df2[features])

Run Code Online (Sandbox Code Playgroud)

我想平均加上（？）。我是否只需在更改了Independent_feature值的df2值上使用model.predict（）就能做到这一点？

Answer 1

Dav*_*ale 6

R公式的工作原理

r问题中提出的公式适用于randomForest。随机森林中的每棵树都试图直接预测目标变量。因此，每棵树的预测都位于预期的间隔内（在您的情况下，所有房价均为正数），而整体的预测只是所有单个预测的平均值。

ensemble_prediction = mean(tree_predictions)

Run Code Online (Sandbox Code Playgroud)

这就是公式告诉您的内容：只需对所有树木进行预测x并取平均值即可。

为什么Python PDP值很小

sklearn但是，在中，部分相关性是针对算出的GradientBoostingRegressor。在梯度增强中，每棵树都在当前预测时预测损失函数的导数，该导数仅与目标变量间接相关。对于GB回归，预测为

ensemble_prediction = initial_prediction + sum(tree_predictions * learning_rate)

Run Code Online (Sandbox Code Playgroud)

对于GB分类，预测概率为

ensemble_prediction = softmax(initial_prediction + sum(tree_predictions * learning_rate))

Run Code Online (Sandbox Code Playgroud)

对于这两种情况，部分依赖关系报告为

sum(tree_predictions * learning_rate)

Run Code Online (Sandbox Code Playgroud)

因此，PDP中不包括initial_prediction（因为GradientBoostingRegressor(loss='ls')它仅等于训练的平均值y），这会使预测为否定。

至于其较小的价值范围，y_train在您的示例中为：平均2房屋价值大致为，因此房价可能以百万计。

sklearn公式实际如何运作

我已经说过，sklearn部分依赖函数的值是所有树的平均值。还有一个调整：将所有不相关的功能平均掉。为了描述平均的实际方式，我将引用sklearn 的文档：

对于网格中“目标”特征的每个值，部分依赖函数需要在“互补”特征的所有可能值上边缘化树的预测。在决策树中，无需参考训练数据即可有效评估此功能。对于每个网格点，将执行加权树遍历：如果拆分节点涉及“目标”特征，则遵循相应的左或右分支，否则遵循两个分支，每个分支均按输入的训练样本的分数加权。科。最后，偏倚由所有访问过的叶子的加权平均值给出。对于树集合，再次将每棵树的结果平均。

如果您仍然不满意，请参见源代码。

一个例子

要查看预测已经在因变量的范围内（但只是居中），可以看一个非常有趣的示例：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble.partial_dependence import plot_partial_dependence

np.random.seed(1)
X = np.random.normal(size=[1000, 2])
# yes, I will try to fit a linear function!
y = X[:, 0] * 10 + 50 + np.random.normal(size=1000, scale=5) 
# mean target is 50, range is from 20 to 80, that is +/- 30 standard deviations 
model = GradientBoostingRegressor().fit(X, y)

fig, subplots = plot_partial_dependence(model, X, [0, 1], percentiles=(0.0, 1.0), n_cols=2)
subplots[0].scatter(X[:, 0], y - y.mean(), s=0.3)
subplots[1].scatter(X[:, 1], y - y.mean(), s=0.3)
plt.suptitle('Partial dependence plots and scatters of centered target')
plt.show()

Run Code Online (Sandbox Code Playgroud)

您会看到部分依赖图很好地反映了中心目标变量的真实分布。

如果不仅要单位，而且要与您的均值一致y，则必须在partial_dependence函数结果中添加“丢失”均值，然后手动绘制结果：

from sklearn.ensemble.partial_dependence import partial_dependence
pdp_y, [pdp_x] = partial_dependence(model, X=X, target_variables=[0], percentiles=(0.0, 1.0))
plt.scatter(X[:, 0], y, s=0.3)
plt.plot(pdp_x, pdp_y.ravel() + model.init_.mean)
plt.show()
plt.title('Partial dependence plot in the original coordinates');

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	1937 次
最近记录：	7 年，10 月前