在模型调整中使用交叉验证,我从caret::train的results对象中获得不同的错误率,并自己计算其对象上的错误pred。我想了解它们为何不同,以及理想情况下如何使用折叠错误率进行模型选择、绘制模型性能等。
该pred对象包含折叠外的预测。该文档非常清楚,trainControl(..., savePredictions = "final")保存了最佳超参数值的折叠预测:“应保存每次重采样的保留预测量的指标......“最终”保存了最佳调整的预测参数。” (保留“所有”预测然后过滤到最佳调整值并不能解决问题。)
文档train说该results对象是“训练错误率的数据框......”我不确定这意味着什么,但最佳行的值始终与 上计算的指标不同pred。为什么它们不同以及如何使它们对齐?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Run Code Online (Sandbox Code Playgroud)
由reprex 包(v0.2.0)于 2018-04-09 创建。
交叉验证的 RMSE 不是按照您显示的方式计算的,而是针对每次折叠计算的,然后求平均值。完整示例:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
Run Code Online (Sandbox Code Playgroud)
最佳模型的 RMSE 为0.6106390
现在计算每次折叠的 RMSE 和平均值:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
Run Code Online (Sandbox Code Playgroud)