Dav*_*vid 4 r resampling cross-validation r-caret
我真的很喜欢使用插入符号至少在建模的早期阶段,特别是因为它非常容易使用重采样方法.然而,我正在开发一个模型,其中训练集通过半监督自我训练添加了相当数量的案例,并且我的交叉验证结果因此而真正偏离.我对此的解决方案是使用验证集来测量模型性能,但我看不到直接在插入符中使用验证集的方法 - 我是否遗漏了某些东西或者这只是不支持?我知道我可以编写自己的包装器去做插入符号通常用于m的插件,但是如果有一个解决方法而不必这样做会非常好.
以下是我遇到的一个简单例子:
> library(caret)
> set.seed(1)
>
> #training/validation sets
> i <- sample(150,50)
> train <- iris[-i,]
> valid <- iris[i,]
>
> #make my model
> tc <- trainControl(method="cv")
> model.rf <- train(Species ~ ., data=train,method="rf",trControl=tc)
>
> #model parameters are selected using CV results...
> model.rf
100 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 90, 90, 90, 89, 90, 92, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa Accuracy SD Kappa SD
2 0.971 0.956 0.0469 0.0717
3 0.971 0.956 0.0469 0.0717
4 0.971 0.956 0.0469 0.0717
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
>
> #have to manually check validation set
> valid.pred <- predict(model.rf,valid)
> table(valid.pred,valid$Species)
valid.pred setosa versicolor virginica
setosa 17 0 0
versicolor 0 20 1
virginica 0 2 10
> mean(valid.pred==valid$Species)
[1] 0.94
Run Code Online (Sandbox Code Playgroud)
我原本以为我可以通过summaryFunction()为trainControl()对象创建自定义来实现这一点,但我无法看到如何引用我的模型对象以从验证集获取预测(文档 - http://caret.r-forge.r-project.org /training.html - 仅列出"data","lev"和"model"作为可能的参数.例如,这显然不起作用:
tc$summaryFunction <- function(data, lev = NULL, model = NULL){
data.frame(Accuracy=mean(predict(<model object>,valid)==valid$Species))
}
Run Code Online (Sandbox Code Playgroud)
编辑:为了尝试提出一个真正丑陋的修复,我一直在寻找是否可以从另一个函数的范围访问模型对象,但我甚至没有看到它们存储在任何地方的模型.希望有一些优雅的解决方案,我甚至没有接近看到......
> tc$summaryFunction <- function(data, lev = NULL, model = NULL){
+ browser()
+ data.frame(Accuracy=mean(predict(model,valid)==valid$Species))
+ }
> train(Species ~ ., data=train,method="rf",trControl=tc)
note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
Called from: trControl$summaryFunction(testOutput, classLevels, method)
Browse[1]> lapply(sys.frames(),function(x) ls(envi=x))
[[1]]
[1] "x"
[[2]]
[1] "cons" "contrasts" "data" "form" "m" "na.action" "subset"
[8] "Terms" "w" "weights" "x" "xint" "y"
[[3]]
[1] "x"
[[4]]
[1] "classLevels" "funcCall" "maximize" "method" "metric" "modelInfo"
[7] "modelType" "paramCols" "ppMethods" "preProcess" "startTime" "testOutput"
[13] "trainData" "trainInfo" "trControl" "tuneGrid" "tuneLength" "weights"
[19] "x" "y"
[[5]]
[1] "data" "lev" "model"
Run Code Online (Sandbox Code Playgroud)
看看trainControl.现在有选项可以直接指定用于建模数据(index参数)的数据行,以及应该使用哪些行来计算保留估计值(被调用indexOut).我认为那就是你要找的东西.
马克斯