Pet*_*wth 7 r poisson offset xgboost
我正在尝试使用XGBoost来模拟从不等长曝光时间段生成的数据的声明频率,但是无法使模型正确处理曝光.我通常会通过将log(曝光)设置为偏移量来实现此目的 - 您是否可以在XGBoost中执行此操作?
(这里发布了一个类似的问题:xgboost,偏移曝光?)
为了说明这个问题,下面的R代码使用以下字段生成一些数据:
目标是使用x1和x2预测频率 - 真实模型是:如果x1 = x2 = 1则频率= 2,否则频率= 1.
曝光不能用于预测频率,因为在政策开始时不知道.我们可以使用它的唯一方法是:预期的索赔数量=频率*曝光率.
代码尝试使用XGBoost通过以下方式预测:
在这些下面,我已经展示了如何处理树(rpart)或gbm的情况.
set.seed(1)
size<-10000
d <- data.frame(
x1 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
x2 = sample(c(0,1),size,replace=T,prob=c(0.5,0.5)),
exposure = runif(size, 1, 10)*0.3
)
d$frequency <- 2^(d$x1==1 & d$x2==1)
d$claims <- rpois(size, lambda = d$frequency * d$exposure)
#### Try to fit using XGBoost
require(xgboost)
param0 <- list(
"objective" = "count:poisson"
, "eval_metric" = "logloss"
, "eta" = 1
, "subsample" = 1
, "colsample_bytree" = 1
, "min_child_weight" = 1
, "max_depth" = 2
)
## 1 - set weight in xgb.Matrix
xgtrain = xgb.DMatrix(as.matrix(d[,c("x1","x2")]), label = d$claims, weight = d$exposure)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_1 <- predict(xgb, xgtrain)
## 2 - set as offset in xgb.Matrix
xgtrain.mf <- model.frame(as.formula("claims~x1+x2+offset(log(exposure))"),d)
xgtrain.m <- model.matrix(attr(xgtrain.mf,"terms"),data = d)
xgtrain <- xgb.DMatrix(xgtrain.m,label = d$claims)
xgb = xgb.train(
nrounds = 1
, params = param0
, data = xgtrain
)
d$XGB_P_2 <- predict(model, xgtrain)
#### Fit a tree
require(rpart)
d[,"tree_response"] <- cbind(d$exposure,d$claims)
tree <- rpart(tree_response ~ x1 + x2,
data = d,
method = "poisson")
d$Tree_F <- predict(tree, newdata = d)
#### Fit a GBM
gbm <- gbm(claims~x1+x2+offset(log(exposure)),
data = d,
distribution = "poisson",
n.trees = 1,
shrinkage=1,
interaction.depth=2,
bag.fraction = 0.5)
d$GBM_F <- predict(gbm, newdata = d, n.trees = 1, type="response")
Run Code Online (Sandbox Code Playgroud)
至少与所述glmR中的功能,造型count ~ x1 + x2 + offset(log(exposure))用family=poisson(link='log')相当于建模I(count/exposure) ~ x1 + x2与family=poisson(link='log')和weight=exposure。也就是说,通过曝光对您的计数进行归一化以获得频率,并以曝光为权重来建模频率。glm用于泊松回归时,您的估计系数在两种情况下都应该相同。使用示例数据集亲自尝试
我不完全确定objective='count:poisson'对应的是什么,但我希望将您的目标变量设置为频率(计数/曝光)并使用曝光作为权重xgboost将是当曝光变化时要走的路。