如何计算xgboost封面？

Question

如何计算xgboost封面？

有人可以解释如何在函数中计算R包中的Cover列吗？xgboostxgb.model.dt.tree

在文档中,它说Cover "是衡量受分割影响的观察数量的指标".

当您运行xgboost此函数的文档中给出的以下代码时,Cover树0的节点0为1628.2500.

data(agaricus.train, package='xgboost')

#Both dataset are list with two items, a sparse matrix and labels
#(labels = outcome column which will be learned).
#Each column of the sparse Matrix is a feature in one hot encoding format.
train <- agaricus.train

bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")

#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
xgb.model.dt.tree(agaricus.train$data@Dimnames[[2]], model = bst)

Run Code Online (Sandbox Code Playgroud)

在火车数据集中有6513个观测值,所以有人可以解释为什么Cover树0的节点0是这个数字的四分之一(1628.25)？

此外,Cover树1的节点1是788.852 - 这个数字是如何计算的？

任何帮助将非常感激.谢谢.

Answer 1

T. *_*arf 22

封面定义xgboost如下:

分类为叶子的训练数据的二阶梯度之和,如果是平方损失,则这简单地对应于该分支中的实例数.树中的节点越深,该度量标准就越低

https://github.com/dmlc/xgboost/blob/f5659e17d5200bd7471a2e735177a81cb8d3012b/R-package/man/xgb.plot.tree.Rd 没有特别好记录....

为了计算覆盖率,我们需要知道树中该点的预测,以及关于损失函数的二阶导数.

幸运的是,对于您的示例中0-0节点中的每个数据点(6513)的预测是.5.这是一个全局默认设置,您在t = 0时的第一个预测是.5.

base_score [default = 0.5]所有实例的初始预测分数,全局偏差

http://xgboost.readthedocs.org/en/latest/parameter.html

二元逻辑的梯度(这是你的目标函数)是py,其中p =你的预测,y =真正的标签.

因此,粗麻布(我们需要它)是p*(1-p). 注意:可以在没有y的情况下确定Hessian,即真正的标签.

所以(把它带回家):

6513*(.5)*(1 - .5)= 1628.25

在第二个树中,那个点上的预测不再是全部.5,sp让我们在一棵树之后得到预测

p = predict(bst,newdata = train$data, ntree=1)

head(p)
[1] 0.8471184 0.1544077 0.1544077 0.8471184 0.1255700 0.1544077

sum(p*(1-p))  # sum of the hessians in that node,(root node has all data)
[1] 788.8521

Run Code Online (Sandbox Code Playgroud)

注意,对于线性(平方误差)回归,粗麻布总是一个,所以封面表示该叶子中有多少个例子.

最重要的是,封面是由目标函数的粗糙定义的.在获得梯度和二元逻辑函数的粗麻布方面有很多信息.

这些幻灯片有助于了解为什么他使用hessians作为加权,并解释了如何xgboost与标准树分开. https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

归档时间：	10 年，3 月前
查看次数：	3761 次
最近记录：	10 年，3 月前