在尝试使用插入符号包训练随机森林模型时,我注意到执行时间莫名其妙:
> set.seed = 1;
> n = 500;
> m = 30;
> x = matrix(rnorm(n * m), nrow = n);
> y = factor(sample.int(2, n, replace = T), labels = c("yes", "no"))
> require(caret);
> require(randomForest);
> print(system.time({rf <- randomForest(x, y);}));
user system elapsed
0.99 0.00 0.98
> print(system.time({rfmod <- train(x = x, y = y,
+ method = "rf",
+ metric = "Accuracy",
+ trControl = trainControl(classProbs = T)
+ );}));
user system elapsed
95.83 0.71 97.26 …Run Code Online (Sandbox Code Playgroud) 我想比较标准的神经网络方法发挥到了极致学习机分类器(基于ROC指标),使用方法"nnet",并"elm"在R包caret。对于nnet,一切正常,但是使用时method = "elm"出现以下错误:
Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, :
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning messages:
1: In train.default(x, y, weights = w, ...) :
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted …Run Code Online (Sandbox Code Playgroud) 主要问题:
在阅读文档和谷歌搜索之后,我仍然难以确定预先定义重采样指数的情况,例如:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
Run Code Online (Sandbox Code Playgroud)
或预定义的种子,如:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
Run Code Online (Sandbox Code Playgroud)
我的计划是通过doParallel软件包使用并行处理来训练一堆不同的可重现模型.由于已经设置了种子,是否不需要预定义重新采样?我是否需要以上述方式预定义种子,而不是在trainControl对象中设置seeds = NULL,因为我打算使用并行处理?是否有任何理由预先定义索引和种子,因为我通过搜索谷歌至少看过一次?什么是使用indexOut的原因?
问题:
到目前为止,我已经设法为RF运行良好的列车:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method …Run Code Online (Sandbox Code Playgroud) 我收到以下错误,我不知道可能出了什么问题.我正在使用R Studio和3.1.3版本的R for Windows 8.1并使用Caret包进行数据挖掘.
我有以下培训数据:
str(training)
'data.frame': 212300 obs. of 21 variables:
$ FL_DATE_MDD_MMDD : int 101 101 101 101 101 101 101 101 101 101 ...
$ FL_DATE : int 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 1012013 ...
$ UNIQUE_CARRIER : Factor w/ 13 levels "9E","AA","AS",..: 11 10 2 5 8 9 11 10 10 10 ...
$ DEST : Factor w/ 150 levels "ABE","ABQ","ALB",..: 111 70 82 8 8 31 110 44 53 80 …Run Code Online (Sandbox Code Playgroud) 您好,并提前致谢.我正在使用caret从nnet包中交叉验证神经网络.在函数的method参数中,trainControl我可以指定交叉验证类型,但所有这些都随机选择观察结果以进行交叉验证.无论如何,我可以使用插入符号通过ID或硬编码参数来交叉验证我的数据中的特定观察结果吗?例如,这是我当前的代码:
library(nnet)
library(caret)
library(datasets)
data(iris)
train.control <- trainControl(
method = "repeatedcv"
, number = 4
, repeats = 10
, verboseIter = T
, returnData = T
, savePredictions = T
)
tune.grid <- expand.grid(
size = c(2,4,6,8)
,decay = 2^(-3:1)
)
nnet.train <- train(
x = iris[,1:4]
, y = iris[,5]
, method = "nnet"
, preProcess = c("center","scale")
, metric = "Accuracy"
, trControl = train.control
, tuneGrid = tune.grid …Run Code Online (Sandbox Code Playgroud) 我正在使用插入符包训练R中的模型:
ctrl <- trainControl(method = "repeatedcv", repeats = 3, summaryFunction = twoClassSummary)
logitBoostFit <- train(LoanStatus~., credit, method = "LogitBoost", family=binomial, preProcess=c("center", "scale", "pca"),
trControl = ctrl)
Run Code Online (Sandbox Code Playgroud)
我收到以下警告:
Warning message:
In train.default(x, y, weights = w, ...): The metric "Accuracy" was not in the result set. ROC will be used instead.Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric …Run Code Online (Sandbox Code Playgroud) 我正在构建一个CART模型,我正在尝试调整rpart-CP和Maxdepth的2个参数.虽然Caret软件包一次适用于一个参数,但当两者都使用时它会不断抛出错误而我无法弄清楚为什么
library(caret)
data(iris)
tc <- trainControl("cv",10)
rpart.grid <- expand.grid(cp=seq(0,0.1,0.01), minsplit=c(10,20))
train(Petal.Width ~ Petal.Length + Sepal.Width + Sepal.Length, data=iris, method="rpart",
trControl=tc, tuneGrid=rpart.grid)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Error in train.default(x, y, weights = w, ...) :
The tuning parameter grid should have columns cp
Run Code Online (Sandbox Code Playgroud) 在插入符号中使用glmnet时出现错误
下面的示例加载库
library(dplyr)
library(caret)
library(C50)
Run Code Online (Sandbox Code Playgroud)
从库C50加载流失数据集
data(churn)
Run Code Online (Sandbox Code Playgroud)
创建x和y变量
churn_x <- subset(churnTest, select= -churn)
churn_y <- churnTest[[20]]
Run Code Online (Sandbox Code Playgroud)
使用createFolds()在churn_y(目标变量)上创建5个CV折叠
myFolds <- createFolds(churn_y, k = 5)
Run Code Online (Sandbox Code Playgroud)
创建trainControl对象:myControl
myControl <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
Run Code Online (Sandbox Code Playgroud)
适合glmnet模型:model_glmnet
model_glmnet <- train(
x = churn_x, y = churn_y,
metric = "ROC",
method = "glmnet",
trControl = myControl
)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误
lognet(x,is.sparse,ix,jx,y,权重,偏移量,alpha,nobs,错误::NA / NaN / Inf在外部函数调用中(arg 5)另外:警告消息:在lognet(x,is .sparse,ix,jx,y,权重,偏移量,alpha,nobs:NAS由强制性引入
我已经检查过,并且churn_x变量中没有缺失值
sum(is.na(churn_x))
Run Code Online (Sandbox Code Playgroud)
有人知道答案吗?
我在R中使用插入符号进行逻辑回归:
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
savePredictions = TRUE)
mod_fit <- train(Y ~ ., data=df, method="glm", family="binomial",
trControl = ctrl)
print(mod_fit)
Run Code Online (Sandbox Code Playgroud)
打印的默认指标是准确度和Cohen kappa.我想提取匹配的指标,如敏感性,特异性,阳性预测值等,但我找不到一个简单的方法来做到这一点.提供了最终的模型,但它对所有数据进行了训练(据我从文档中可以看出),所以我不能用它来重新预测.
混淆矩阵计算所有必需参数,但将其作为汇总函数传递不起作用:
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10,
savePredictions = TRUE, summaryFunction = confusionMatrix)
mod_fit <- train(Y ~ ., data=df, method="glm", family="binomial",
trControl = ctrl)
Error: `data` and `reference` should be factors with the same levels.
13.
stop("`data` and `reference` should be factors with the same levels.", …Run Code Online (Sandbox Code Playgroud) 我正在尝试使用这些包quanteda,caret并根据经过训练的样本对文本进行分类.作为试运行,我想比较的内置的朴素贝叶斯分类器quanteda与的那些caret.但是,我似乎caret无法正常工作.
这是一些复制代码.首先是quanteda侧面:
library(quanteda)
library(quanteda.corpora)
library(caret)
corp <- data_corpus_movies
set.seed(300)
id_train <- sample(docnames(corp), size = 1500, replace = FALSE)
# get training set
training_dfm <- corpus_subset(corp, docnames(corp) %in% id_train) %>%
dfm(stem = TRUE)
# get test set (documents not in id_train, make features equal)
test_dfm <- corpus_subset(corp, !docnames(corp) %in% id_train) %>%
dfm(stem = TRUE) %>%
dfm_select(pattern = training_dfm,
selection = "keep")
# train model on sentiment
nb_quanteda <- textmodel_nb(training_dfm, …Run Code Online (Sandbox Code Playgroud)