尝试使用如下所示的y数据拟合Random Forest Regressor模型时:
[ 0.00000000e+00 1.36094276e+02 4.46608221e+03 8.72660888e+03
1.31375786e+04 1.73580193e+04 2.29420671e+04 3.12216341e+04
4.11395711e+04 5.07972062e+04 6.14904935e+04 7.34275322e+04
7.87333933e+04 8.46302456e+04 9.71074959e+04 1.07146672e+05
1.17187952e+05 1.26953374e+05 1.37736003e+05 1.47239359e+05
1.53943242e+05 1.78806710e+05 1.92657725e+05 2.08912711e+05
2.22855152e+05 2.34532982e+05 2.41391255e+05 2.48699216e+05
2.62421197e+05 2.79544300e+05 2.95550971e+05 3.13524275e+05
3.23365158e+05 3.24069067e+05 3.24472999e+05 3.24804951e+05
Run Code Online (Sandbox Code Playgroud)
和X数据看起来像这样:
[ 735233.27082176 735234.27082176 735235.27082176 735236.27082176
735237.27082176 735238.27082176 735239.27082176 735240.27082176
735241.27082176 735242.27082176 735243.27082176 735244.27082176
735245.27082176 735246.27082176 735247.27082176 735248.27082176
Run Code Online (Sandbox Code Playgroud)
使用以下代码:
regressor = RandomForestRegressor(n_estimators=150, min_samples_split=1)
rgr = regressor.fit(X,y)
Run Code Online (Sandbox Code Playgroud)
我收到此错误:
ValueError: Number of labels=600 does not match number of samples=1
Run Code Online (Sandbox Code Playgroud)
我假设我的一组值的格式错误,但是从文档中我不太清楚.
我在python中使用RandomForestRegressor,我想创建一个图表来说明功能重要性的排名.这是我使用的代码:
from sklearn.ensemble import RandomForestRegressor
MT= pd.read_csv("MT_reduced.csv")
df = MT.reset_index(drop = False)
columns2 = df.columns.tolist()
# Filter the columns to remove ones we don't want.
columns2 = [c for c in columns2 if c not in["Violent_crime_rate","Change_Property_crime_rate","State","Year"]]
# Store the variable we'll be predicting on.
target = "Property_crime_rate"
# Let’s randomly split our data with 80% as the train set and 20% as the test set:
# Generate the training set. Set random_state to be able to replicate results.
train2 = …Run Code Online (Sandbox Code Playgroud) 如何使用randomForest具有观察权重的R 包?我知道这个包里没有这样的选择.我有两个问题:
使用randomForest包有没有解决这个问题的方法?此刻我正在从权重数据中抽取样本,因此我至少可以模拟它:
m = dim(data)[1]
sample(data, m, replace=TRUE, prob=weights)
Run Code Online (Sandbox Code Playgroud)
它有其他(更好的)解决方案吗?
是否有任何替代randomForest方案.我找到了party包(cforest),但它在内存管理方面很糟糕(或者我不能像使用randomForest包那样使用它).我有大约200k观测值和30-40个变量.
编辑:
很抱歉没有澄清细节.我正在使用randomForest包来回归问题(不是分类).这是一个时间序列,每个观察都有它的重量.稍后,此权重用于确定测试观察的模型性能.y变量是连续的.
我在没有找到解决方案的情况下进行了广泛的研究.我已经清理了我的数据集如下:
library("raster")
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x) ,
mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
colSums(is.na(losses))
isinf <- function(x) (NA <- is.infinite(x))
infout <- apply(losses, 2, is.infinite)
colSums(infout)
isnan <- function(x) (NA <- is.nan(x))
nanout <- apply(losses, 2, is.nan)
colSums(nanout)
Run Code Online (Sandbox Code Playgroud)
问题出现了运行预测算法:
options(warn=2)
p <- predict(default.rf, losses, type="prob", inf.rm = TRUE, na.rm=TRUE, nan.rm=TRUE)
Run Code Online (Sandbox Code Playgroud)
所有的研究都表明它应该是数据中的NA或Inf或NaN,但我没有发现任何数据.我正在制作数据和randomForest摘要可用于[删除] Traceback的调查并没有显示太多(对我来说):
4: .C("classForest", mdim = as.integer(mdim), ntest = as.integer(ntest),
nclass = as.integer(object$forest$nclass), maxcat = as.integer(maxcat),
nrnodes = as.integer(nrnodes), jbt = as.integer(ntree), …Run Code Online (Sandbox Code Playgroud) 我的大小38 MB的训练集(12个属性与420000行).我运行下面的R代码片段,训练使用模型randomForest.这对我来说需要几个小时.
rf.model <- randomForest(
Weekly_Sales~.,
data=newdata,
keep.forest=TRUE,
importance=TRUE,
ntree=200,
do.trace=TRUE,
na.action=na.roughfix
)
Run Code Online (Sandbox Code Playgroud)
我认为,由于na.roughfix,它是需要长时间来执行.有这么多的NA's训练集中.
可能有人让我知道我怎么能提高性能?
我的系统配置是:
Intel(R) Core i7 CPU @ 2.90 GHz
RAM - 8 GB
HDD - 500 GB
64 bit OS
Run Code Online (Sandbox Code Playgroud) 我正在尝试提取我使用PySpark训练的随机森林对象的要素重要性.但是,我没有看到在文档中的任何地方执行此操作的示例,也不是RandomForestModel的方法.
如何从RandomForestModelPySpark中的回归器或分类器中提取要素重要性?
以下是文档中提供的示例代码,以帮助我们开始; 但是,没有提到其中的特征重要性.
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, …Run Code Online (Sandbox Code Playgroud) 我只是想做一个简单的RandomForestRegressor示例.但在测试准确性时,我得到了这个错误
Run Code Online (Sandbox Code Playgroud)/Users/noppanit/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.pycin accuracy_score(y_true,y_pred,normalize,sample_weight)177 178#计算每种可能表示的准确性 - > 179 y_type,y_true,y_pred = _check_targets(y_true,y_pred)180如果y_type.startswith('multilabel'):181 differing_labels = count_nonzero(y_true - y_pred,axis = 1)
Run Code Online (Sandbox Code Playgroud)/Users/noppanit/anaconda/lib/python2.7/site-packages/sklearn/metrics/classification.pycin _check_targets(y_true,y_pred)90 if(y_type不在["binary","multiclass","multilabel-indicator",91"multilabel-sequences"]):---> 92引发ValueError("{0}是不支持".format(y_type)"93 94如果["binary","multiclass"]中的y_type:
Run Code Online (Sandbox Code Playgroud)ValueError: continuous is not supported
这是数据的样本.我无法显示真实数据.
target, func_1, func_2, func_2, ... func_200
float, float, float, float, ... float
Run Code Online (Sandbox Code Playgroud)
这是我的代码.
import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import tree
train = pd.read_csv('data.txt', sep='\t')
labels = …Run Code Online (Sandbox Code Playgroud) 我第一次使用随机森林进行处理,我遇到了一些我无法弄清楚的问题.当我对所有数据集(大约3000行)进行分析时,我没有收到任何错误消息.但是当我对我的数据集的子集(大约300行)执行相同的分析时,我收到一个错误:
dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)
# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)
Error in randomForest.default(m, y, ...) : Can't have empty classes …Run Code Online (Sandbox Code Playgroud) R 3.0.0的一个新特性是引入了长向量.但是,.C()和.Fortran()不接受长矢量输入.在R-bloggers上我发现:
这是一种预防措施,因为现有代码不太可能被编写来处理长向量(并且R包装器通常假设长度(x)是整数)
我使用R-package randomForest,这个包显然需要.Fortran(),因为它崩溃了,留下了错误信息
randomForest.default出错:.Fortran不支持long向量(参数20)
如何克服这个问题?我在Windows 7 64位计算机上使用randomForest 4.6-7(在R 3.0.2下构建).
当运行Spark的RandomForest算法时,即使使用相同的种子,我似乎在不同的运行中在树中获得不同的分割.任何人都可以解释我是否做错了(可能),或者实施是错误的(我认为不太可能)?这是我的运行方案:
//read data into rdd
//convert string rdd to LabeledPoint rdd
// train_LP_RDD is RDD of LabeledPoint
// call random forest
val seed = 123417
val numTrees = 10
val numClasses = 2
val categoricalFeaturesInfo: Map[Int, Int] = Map()
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 8
val maxBins = 10
val rfmodel = RandomForest.trainClassifier(train_LP_RDD, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,seed)
println(rfmodel.toDebugString)
Run Code Online (Sandbox Code Playgroud)
在两个不同的运行中,此代码段的输出是不同的.例如,两个结果的差异显示如下:
sdiff -bBWs run1.debug run2.debug
If (feature 2 <= 15.96) | If (feature …Run Code Online (Sandbox Code Playgroud) random-forest ×10
r ×5
python ×3
apache-spark ×2
scikit-learn ×2
dataframe ×1
numpy ×1
pandas ×1
performance ×1
plot ×1
predict ×1
pyspark ×1
scala ×1
vector ×1