I am attempting to build a model to predict whether a product will get sold on an ecommerce website with 1 or 0 being the output.
My data is a handful of categorical variables, one with a large amount of levels, a couple binary, and one continuous (the price), with an output variable of 1 or 0, whether or not the product listing got sold.
This is my code:
inTrainingset<-createDataPartition(C$Sale, p=.75, list=FALSE)
CTrain<-C[inTrainingset,]
CTest<-C[-inTrainingset,]
gbmfit<-gbm(Sale~., data=C,distribution="bernoulli",n.trees=5,interaction.depth=7,shrinkage= .01,)
plot(gbmfit)
gbmTune<-train(Sale~.,data=CTrain, method="gbm")
ctrl<-trainControl(method="repeatedcv",repeats=5)
gbmTune<-train(Sale~.,data=CTrain,
method="gbm",
verbose=FALSE,
trControl=ctrl)
ctrl<-trainControl(method="repeatedcv", repeats=5, classProbs=TRUE, summaryFunction = twoClassSummary)
gbmTune<-trainControl(Sale~., data=CTrain,
method="gbm",
metric="ROC",
verbose=FALSE ,
trControl=ctrl)
grid<-expand.grid(.interaction.depth=seq(1,7, by=2), .n.trees=seq(100,300, by=50), .shrinkage=c(.01,.1))
gbmTune<-train(Sale~., data=CTrain,
method="gbm",
metric="ROC",
tunegrid= grid,
verebose=FALSE,
trControl=ctrl)
set.seed(1)
gbmTune <- train(Sale~., data = CTrain,
method = "gbm",
metric = "ROC",
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl)
Run Code Online (Sandbox Code Playgroud)
我遇到了两个问题。第一个是当我尝试添加 summaryFunction=twoClasssummary,然后调整我得到这个:
Error in trainControl(Sale ~ ., data = CTrain, method = "gbm", metric = "ROC", :
unused arguments (data = CTrain, metric = "ROC", trControl = ctrl)
Run Code Online (Sandbox Code Playgroud)
如果我决定绕过summaryFunction,第二个问题是当我尝试运行模型时出现此错误:
Error in evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels, :
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
In addition: Warning message:
In train.default(x, y, weights = w, ...) :
cannnot compute class probabilities for regression
Run Code Online (Sandbox Code Playgroud)
我尝试在 excel 中将输出变量从数值 1 或 0 更改为仅文本值,但这并没有什么区别。
关于如何解决将这个模型解释为回归或我遇到的第一条错误消息这一事实,我们将不胜感激。
最好的事物,
Will will@nubimetrics.com
你的结果是:
Sale = c(1L, 0L, 1L, 1L, 0L))
Run Code Online (Sandbox Code Playgroud)
尽管以gbm这种方式期望它,但对数据进行编码是非常不自然的方式。几乎所有其他函数都使用因子。
所以如果你给出train数字 0/1 的数据,它认为你想做回归。如果您将其转换为一个因子并使用“0”和“1”作为级别(并且如果您想要类概率),您应该会看到一个警告说“至少一个类级别不是有效的 R 变量名称; 如果生成类概率,这可能会导致错误,因为变量名称将被转换为..."。这不是一个闲置的警告。
使用有效的 R 变量名称的因子水平,你应该没问题。
最大限度