我是R的新手(第2天),他的任务是建造一片随意的森林.每个随机森林将使用不同的训练集建立,我们将结束所有森林进行预测.我在R中实现这个,并且在使用不使用相同集合构建的两个森林时遇到一些困难.我的尝试如下:
d1 = read.csv("../data/rr/train/10/chunk0.csv",header=TRUE)
d2 = read.csv("../data/rr/train/10/chunk1.csv",header=TRUE)
rf1 = randomForest(A55~., data=d1, ntree=10)
rf2 = randomForest(A55~., data=d2, ntree=10)
rf = combine(rf1,rf2)
Run Code Online (Sandbox Code Playgroud)
这当然会产生错误:
Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, rflist[[i]]$votes) :
non-conformable arrays
In addition: Warning message:
In rf$oob.times + rflist[[i]]$oob.times :
longer object length is not a multiple of shorter object length
Run Code Online (Sandbox Code Playgroud)
我已经浏览网页一段时间了解这一点,但尚未取得任何成功.这里的任何帮助将非常感激.
我已经使用非常大的数据集训练了一个来自Python Sckit Learn Module的RandomForestClassifier,但问题是我怎么可能保存这个模型并让其他人在它们的末尾应用它.谢谢!
寻找一种有效的方法在rstudio,H2O的Flow或h2o的RF和GBM模型的本地html页面中绘制树,类似于下面链接中的图像.具体来说,如何通过解析h2o.download_pojo(rf1)或h2o.download_pojo(gbm1)来为下面的代码生成的对象(拟合模型)rf1和gbm2绘制树?
# # The following two commands remove any previously installed H2O packages for R.
# if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
# if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# # Next, we download packages that H2O depends on.
# pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
# for (pkg in pkgs) {
# if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
# }
#
# # Now we download, install h2o package
# install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/3/R")))
library(h2o)
h2o.init(nthreads = …Run Code Online (Sandbox Code Playgroud) 是否有可能在Xgboost中训练具有多个连续输出的模型(多元回归)?培养这种模型的目标是什么?
在此先感谢您的任何建议
我最初有一个由N行12列组成的数据框.最后一列是我的班级(0或1).我不得不将我的整个数据框转换为数字
training <- sapply(training.temp,as.numeric)
Run Code Online (Sandbox Code Playgroud)
但后来我觉得我需要将class列作为因子列来使用randomforest()工具作为分类器,所以我做了
training[,"Class"] <- factor(training[,ncol(training)])
Run Code Online (Sandbox Code Playgroud)
我继续创建树
training_rf <- randomForest(Class ~., data = trainData, importance = TRUE, do.trace = 100)
Run Code Online (Sandbox Code Playgroud)
但我得到两个错误:
1: In Ops.factor(training[, "Status"], factor(training[, ncol(training)])) :
<= this is not relevant for factors (roughly translated)
2: In randomForest.default(m, y, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?
Run Code Online (Sandbox Code Playgroud)
如果有人能指出我正在制作的格式错误,我将不胜感激.
谢谢!
在64位Linux机器上使用R 3.2.0 with caret 6.0-41和randomForest 4.6-10.
当尝试使用公式对使用包中的函数训练predict()的randomForest对象使用该方法时,该函数返回错误.当通过训练和/或使用和而不是公式,这一切都顺利进行.train()caretrandomForest()x=y=
这是一个工作示例:
library(randomForest)
library(caret)
data(imports85)
imp85 <- imports85[, c("stroke", "price", "fuelType", "numOfDoors")]
imp85 <- imp85[complete.cases(imp85), ]
imp85[] <- lapply(imp85, function(x) if (is.factor(x)) x[,drop=TRUE] else x) ## Drop empty levels for factors.
modRf1 <- randomForest(numOfDoors~., data=imp85)
caretRf <- train( numOfDoors~., data=imp85, method = "rf" )
modRf2 <- caretRf$finalModel
modRf3 <- randomForest(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"])
caretRf <- train(x=imp85[,c("stroke", "price", "fuelType")], y=imp85[, "numOfDoors"], method = "rf")
modRf4 …Run Code Online (Sandbox Code Playgroud) 考虑到上述(树枝)条件存在,我猜测它是条件概率.但是,我不清楚.
如果您想了解有关所用数据的更多信息或我们如何获得此图表,请访问:http://machinelearningmastery.com/visualize-gradient-boosting-decision-trees-xgboost-python/
我目前正在使用H2O作为分类问题数据集.我H2ORandomForestEstimator在python 3.6环境中测试它.我注意到预测方法的结果给出了0到1之间的值(我假设这是概率).
在我的数据集中,目标属性是数字,即True值为1,False值为0.我确保将类型转换为目标属性的类别,我仍然得到相同的结果.
然后我修改了代码,将目标列转换为asfactor()H2OFrame上的因子使用方法,结果没有任何变化.
但是当我将目标属性中的值分别更改为1和0时的True和False时,我得到了预期结果(即)输出是分类而不是概率.
这是我的代码:
set.seed(1)
#Boruta on the HouseVotes84 data from mlbench
library(mlbench) #has HouseVotes84 data
library(h2o) #has rf
#spin up h2o
myh20 <- h2o.init(nthreads = -1)
#read in data, throw some away
data(HouseVotes84)
hvo <- na.omit(HouseVotes84)
#move from R to h2o
mydata <- as.h2o(x=hvo,
destination_frame= "mydata")
#RF columns (input vs. output)
idxy <- 1
idxx <- 2:ncol(hvo)
#split data
splits <- h2o.splitFrame(mydata,
c(0.8,0.1))
train <- h2o.assign(splits[[1]], key="train")
valid <- h2o.assign(splits[[2]], key="valid")
# make random forest
my_imp.rf<- h2o.randomForest(y=idxy,x=idxx,
training_frame = train,
validation_frame = …Run Code Online (Sandbox Code Playgroud) 我通过scikit-learn图书馆学习机器学习.我使用以下代码将决策树分类器和随机森林分类器应用于我的数据:
def decision_tree(train_X, train_Y, test_X, test_Y):
clf = tree.DecisionTreeClassifier()
clf.fit(train_X, train_Y)
return clf.score(test_X, test_Y)
def random_forest(train_X, train_Y, test_X, test_Y):
clf = RandomForestClassifier(n_estimators=1)
clf = clf.fit(X, Y)
return clf.score(test_X, test_Y)
Run Code Online (Sandbox Code Playgroud)
为什么随机森林分类器的结果更好(100次运行,随机抽样2/3的数据用于训练,1/3用于测试)?
100%|???????????????????????????????????????| 100/100 [00:01<00:00, 73.59it/s]
Algorithm: Decision Tree
Min : 0.3883495145631068
Max : 0.6476190476190476
Mean : 0.4861783113770316
Median : 0.48868030937802126
Stdev : 0.047158171852401135
Variance: 0.0022238931724605985
100%|???????????????????????????????????????| 100/100 [00:01<00:00, 85.38it/s]
Algorithm: Random Forest
Min : 0.6846846846846847
Max : 0.8653846153846154
Mean : 0.7894823428836184
Median : 0.7906101571063208
Stdev : 0.03231671150915106
Variance: 0.0010443698427656967 …Run Code Online (Sandbox Code Playgroud) python machine-learning decision-tree random-forest scikit-learn
random-forest ×10
python ×4
r ×4
h2o ×3
scikit-learn ×2
xgboost ×2
formula ×1
gbm ×1
gini ×1
predict ×1
r-caret ×1
statistics ×1