为什么R gbm模型预测与模型匹配不匹配?

1 r machine-learning predict r-caret

我正在使用插入符合gbm模型.当我打电话时trainedGBM$finalModel$fit,我得到的输出看起来是正确的.

但是当我打电话时predict(trainedGBM$finalModel, origData, type="response"),我会得到非常不同的结果,predict(trainedGBM$finalModel, type="response")即使附加了origData,结果也会产生不同的结果.根据我的思维方式,这些调用应该产生相同的输出.有人可以帮我识别问题吗?

library(caret)
library(gbm)

attach(origData)
gbmGrid <- expand.grid(.n.trees = c(2000), 
                       .interaction.depth = c(14:20), 
                       .shrinkage = c(0.005))
trainedGBM <- train(y ~ ., method = "gbm", distribution = "gaussian", 
                    data = origData, tuneGrid = gbmGrid, 
                    trControl = trainControl(method = "repeatedcv", number = 10, 
                                             repeats = 3, verboseIter = FALSE, 
                                             returnResamp = "all"))
ntrees <- gbm.perf(trainedGBM$finalModel, method="OOB")
data.frame(y, 
           finalModelFit = trainedGBM$finalModel$fit, 
           predictDataSpec = predict(trainedGBM$finalModel, origData, type="response", n.trees=ntrees), 
           predictNoDataSpec = predict(trainedGBM$finalModel, type="response", n.trees=ntrees))
Run Code Online (Sandbox Code Playgroud)

上面的代码产生以下部分结果:

   y finalModelFit predictDataSpec predictNoDataSpec
9000     6138.8920        2387.182          2645.993
5000     3850.8817        2767.990          2467.157
3000     3533.1183        2753.551          2044.578
2500     1362.9802        2672.484          1972.361
1500     5080.2112        2449.185          2000.568
 750     2284.8188        2728.829          2063.829
1500     2672.0146        2359.566          2344.451
5000     3340.5828        2435.137          2093.939
   0     1303.9898        2377.770          2041.871
 500      879.9798        2691.886          2034.307
3000     2928.4573        2327.627          1908.876
Run Code Online (Sandbox Code Playgroud)

Jim*_* M. 7

根据您的情况gbmGrid,只有您的交互深度在14到20之间变化,树木的收缩率和数量分别固定在0.005和2000.在TrainedGBM当前标准的设计只能找到互动的最佳水平.你的ntrees计算结果gbm.perf然后询问,假设最佳交互水平在14到20之间,那么基于OOB标准的树的最佳数量是多少.因为预测取决于模型中树的数量,所以将使用基于训练的GBM的ntrees = 2000预测,并且基于的预测gbm.perf将使用ntrees从该函数估计的最佳数量 .这将解释你trainedGBM$finalModel$fit和 你之间的差异predict(trainedGBM$finalModel, type="response", n.trees=ntrees).

使用gbm作为分类而不是回归模型来显示基于虹膜数据集的示例

library(caret)
library(gbm)

set.seed(42)

gbmGrid <- expand.grid(.n.trees = 100, 
                   .interaction.depth = 1:4, 
                   .shrinkage = 0.05)


trainedGBM <- train(Species ~ ., method = "gbm", distribution='multinomial',
                data = iris, tuneGrid = gbmGrid, 
                trControl = trainControl(method = "repeatedcv", number = 10, 
                                         repeats = 3, verboseIter = FALSE, 
                                         returnResamp = "all"))
print(trainedGBM)        
Run Code Online (Sandbox Code Playgroud)

# Resampling results across tuning parameters:

#  interaction.depth  Accuracy  Kappa  Accuracy SD  Kappa SD
#   1                  0.947     0.92   0.0407       0.061   
#   2                  0.947     0.92   0.0407       0.061   
#   3                  0.944     0.917  0.0432       0.0648  
#   4                  0.944     0.917  0.0395       0.0592  

# Tuning parameter 'n.trees' was held constant at a value of 100
# Tuning parameter 'shrinkage' was held constant at a value of 0.05
# Accuracy was used to select the optimal model using  the largest value.
# The final values used for the model were interaction.depth = 1, n.trees = 100
# and shrinkage = 0.05.     
Run Code Online (Sandbox Code Playgroud)

根据最佳交互深度找到最佳树木数量:

ntrees <-  gbm.perf(trainedGBM$finalModel, method="OOB")
# Giving ntrees = 50
Run Code Online (Sandbox Code Playgroud)

如果我们通过改变树的数量和交互深度来训练模型:

gbmGrid2 <- expand.grid(.n.trees = 1:100, 
                   .interaction.depth = 1:4, 
                   .shrinkage = 0.05)

trainedGBM2 <- train(Species ~ ., method = "gbm", 
                data = iris, tuneGrid = gbmGrid2, 
                trControl = trainControl(method = "repeatedcv", number = 10, 
                                         repeats = 3, verboseIter = FALSE, 
                                         returnResamp = "all"))

print(trainedGBM2) 

# Tuning parameter 'shrinkage' was held constant at a value of 0.05
# Accuracy was used to select the optimal model using  the largest value.
# The final values used for the model were interaction.depth = 2, n.trees = 39
# and shrinkage = 0.05. 
Run Code Online (Sandbox Code Playgroud)

请注意,当我们改变树木的数量和交互深度时,树木的最佳数量非常接近于计算的数量gbm.perf.