doc*_*ate 5 r classification cross-validation r-caret
这个问题是这里相同主题的延续.以下是本书的最小工作示例:
Wehrens R. Chemometrics在自然科学和生命科学领域进行R多元数据分析.第1版.海德堡; 纽约:施普林格.2011.(第250页).
这个例子取自本书及其包ChemometricsWithR.它强调了使用交叉验证技术进行建模时的一些缺陷.
目标:
一种交叉验证的方法,使用相同的重复CV集来执行已知的策略,PLS通常由LDA逻辑回归,SVM,C5.0,CART或caret包的精神进行.因此,每次调用等待分类器之前都需要PLS,以便对PLS 得分空间进行分类而不是对观察本身进行分类.PCA在使用任何分类器建模之前,插入符号包中最近的方法是作为预处理步骤.下面是一个PLS-LDA程序,只有一个交叉验证来测试分类器的性能,没有10倍的CV或任何重复.下面的代码取自上面提到的书,但有一些更正否则会引发错误:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
Run Code Online (Sandbox Code Playgroud)
输出:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
Run Code Online (Sandbox Code Playgroud)
然而,混淆矩阵与书中的不一致,无论如何,书中的代码确实破了,但这里有一个与我合作!
注意:
虽然这只是一个简历,但目的是首先就这种方法达成一致,sd并且mean将火车组应用于测试集,PLUS根据特定数量的PC转换为PLS分数ncomp.我希望在插入符号的每一轮CV中都能发生这种情况.如果作为代码的方法在这里是正确的,那么它可以在修改插入符号包的代码时用作最小工作示例的良好开端.
附注:
对于缩放和居中可能会非常混乱,我认为R中的一些PLS功能在内部进行缩放,无论是否有居中,我都不确定,因此在插入中构建自定义模型时应小心处理以避免缺乏或多次缩放或中心(我对这些事情保持警惕).
多重居中/缩放
的危险下面的代码只是为了说明多片段居中/缩放如何改变数据,这里只显示居中,但同样的缩放问题也适用.
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
Run Code Online (Sandbox Code Playgroud)
输出:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
Run Code Online (Sandbox Code Playgroud)
如果我遗漏了本课程某处的内容,请发表评论.谢谢.
如果您想要使用这些类型的模型caret,则需要在CRAN上使用最新版本.创建了最后一次更新,以便人们可以根据需要使用非标准模型.
我下面的方法是联合拟合PLS和其他模型(我在下面的例子中使用随机森林)并同时调整它们.因此,对于每一个折,的2D网格ncomp和mtry被使用.
"技巧"是将PLS加载附加到随机森林对象,以便在预测时间内使用它们.以下是定义模型的代码(仅限分类):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
Run Code Online (Sandbox Code Playgroud)
这是对以下内容的呼吁train:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Run Code Online (Sandbox Code Playgroud)
只是为了踢,我得到了:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
Run Code Online (Sandbox Code Playgroud)
和
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
Run Code Online (Sandbox Code Playgroud)
马克斯