标签: random-forest

为什么pickle.dump(obj)与sys.getsizeof(obj)有不同的大小？如何将变量保存到文件文件？

我使用来自python的scikit lib的随机森林的分类器来做我的练习.结果会改变每个运行时间.所以我运行1000次并获得平均结果.

我将对象rf保存到文件中以便稍后通过pickle.dump()进行预测,并获得大约4MB的文件.但是,sys.getsizeof(rf)只给我36个字节

rf = RandomForestClassifier(n_estimators = 50)
rf.fit(matX, vecY)
pickle.dump(rf,'var.sav')

Run Code Online (Sandbox Code Playgroud)

我的问题:

获取RandomForestClassifier对象的大小似乎是错误的sys.getsizeof(),不是吗？为什么？
如何在zip文件中保存对象,使其尺寸更小？

python random-forest

作者

2017 03-15

0
推荐指数

1
解决办法

345
查看次数

在scikit-learn中使用python生成器

我想知道是否以及如何使用python生成器作为scikit-learn分类器的.fit()函数的数据输入？由于数据量巨大,这似乎对我有意义.

特别是我即将实施随机森林方法.

问候K.

python generator random-forest scikit-learn

Krn*_*Krn

lucky-day

0
推荐指数

1
解决办法

2720
查看次数

快速随机森林算法实现

我使用Weka lib和Random Forest实现了一个小的java应用程序.我已经训练了一些带有样本数据的分类器,并获得了大约85%的精确度.但是,当我使用快速随机森林(https://code.google.com/p/fast-random-forest/)时,它会开始抛出错误.

我已经实现了快速随机森林并使用当前的jar文件构建它.但是,当我们评估训练数据上的分类器时,它会不断出现以下错误:

 "The method evaluateModel(Classifier, Instances, Object...) 
  in the type Evaluation is not applicable for the arguments 
  (FastRandomForest, Instances) "

Run Code Online (Sandbox Code Playgroud)

对于这个当前代码:

    FastRandomForest rTree = new FastRandomForest();        
    rTree.buildClassifier(trainingData);

    showTree(rTree);

    System.out.println("records: " + trainingData.attribute(classIndex));
    System.out.println("number of instances: " + trainingData.numInstances());
    System.out.println(trainingData.instance(1));
    System.out.println("target: " + trainingData.classAttribute());
    //System.out.println(rTree.classifyInstance(trainingData.instance(1)));


    /* Evaluate the classifier on Training data */
    Evaluation eTest = new Evaluation(trainingData);
    eTest.evaluateModel(rTree, trainingData); 
    String strSummary = eTest.toSummaryString(); 
    System.out.println(strSummary);

Run Code Online (Sandbox Code Playgroud)

帮助赞赏!!

java algorithm weka random-forest

use*_*165

lucky-day

0
推荐指数

1
解决办法

4047
查看次数

Scikit：如何检查对象是 RandomizedSearchCV 还是 RandomForestClassifier？

我有一些分类器是使用Grid Search创建的，其他分类器是直接创建为Random Forests 的。

随机森林返回 type sklearn.ensemble.forest.RandomForestClassifier，使用 gridSearch 创建的随机森林返回 type sklearn.grid_search.RandomizedSearchCV。

我正在尝试以编程方式检查估计器的类型（以确定是否需要best_estimator_对特征重要性使用），但似乎找不到这样做的好方法。

if type(estimator) == 'sklearn.grid_search.RandomizedSearchCV' 是我的第一个猜测，但显然是错误的。

types python-2.7 random-forest scikit-learn grid-search

sap*_*ico

lucky-day

0
推荐指数

1
解决办法

1633
查看次数

训练时引发随机森林索引越界异常

我正在尝试运行 MLLIB 的随机森林模型，但遇到一些越界异常：

15/09/15 01:53:56 INFO scheduler.DAGScheduler: ResultStage 5 (collect at DecisionTree.scala:977) finished in 0.147 s
15/09/15 01:53:56 INFO scheduler.DAGScheduler: Job 5 finished: collect at DecisionTree.scala:977, took 0.161129 s
15/09/15 01:53:57 INFO rdd.MapPartitionsRDD: Removing RDD 4 from persistence list
15/09/15 01:53:57 INFO storage.BlockManager: Removing RDD 4
Traceback (most recent call last):
  File "/root/random_forest/random_forest_spark.py", line 142, in <module>
    main()
  File "/root/random_forest/random_forest_spark.py", line 121, in main
    trainModel(dset)
  File "/root/random_forest/random_forest_spark.py", line 136, in trainModel
    impurity='gini', maxDepth=4, maxBins=32)
  File "/root/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 352, in trainClassifier …

Run Code Online (Sandbox Code Playgroud)

python random-forest apache-spark apache-spark-mllib

fob*_*122

2016 04-25

0
推荐指数

1
解决办法

2497
查看次数

删除随机森林训练数据集中的行

我想用我的训练数据集的修改版本运行我的随机森林。我的训练数据包含不同的列，其中一列attribute使用 0-6 的值调用。我的想法是只删除0并使用以下代码保留其余部分：

training_data4 <- training_data3[!training_data3$attribute == "0", ]

Run Code Online (Sandbox Code Playgroud)

但是，当我使用训练数据运行随机森林时，我收到以下错误消息：

rf200 <- randomForest(attribute ~ ., data=training_data4, importance=T, 
                      proximity=F, ntree=200 )

Run Code Online (Sandbox Code Playgroud)

randomForest.default(m, y, ...) 中的错误：y 中不能有空类

我已经知道我的肯定有问题training_data4，因为我已经用我的原始训练集尝试过它并且没有这个问题。

r random-forest

JCr*_*Cra

2017 07-18

0
推荐指数

1
解决办法

422
查看次数

随机森林实现之间的差异

H2O 中的随机森林实现与标准随机森林库之间有性能差异吗？

有没有人对这两种实现进行过或分析过。

classification machine-learning random-forest h2o data-science

chi*_*n s

lucky-day

0
推荐指数

1
解决办法

1084
查看次数

使用 sklearn 的 RandomForestRegressor 进行预测

可能是一个非常愚蠢的问题，所以对我来说很容易，但我走了。

所以这就是我的数据的样子......

date,locale,category,site,alexa_rank,sessions,user_logins
20170110,US,1,google,1,500,5000
20170110,EU,1,google,2,400,2000
20170111,US,2,facebook,2,400,2000

Run Code Online (Sandbox Code Playgroud)

... 等等。这只是我想出的一个玩具数据集，但它类似于原始数据。

我正在尝试使用 sklearn 的RandomForestRegressor.

我做通常的事情，将类别编码为标签，并且我已经在一年的前八个月训练了我的模型，现在我想预测第九个月的登录和会话。我创建了一个接受登录训练的模型和另一个接受会话训练的模型。

我的测试数据集的形式相同：

date,locale,category,site,alexa_rank,sessions,user_logins
20170910,US,1,google,1,500,5000
20170910,EU,1,google,2,400,2000
20170911,US,2,facebook,2,400,2000

Run Code Online (Sandbox Code Playgroud)

理想情况下，我希望在没有我需要预测的列的情况下传入测试数据集，但 RandomForestRegressor 抱怨训练集和测试集之间的维度不同。

当我以当前形式传递测试数据集时，模型会在大多数情况下预测和列中的确切值，否则预测值会有微小的变化。sessionsuser_logins

我将测试数据中的sessions和user_logins列归零并将其传递给模型，但模型预测几乎全部为零。

我的工作流程是否正确？我是否正确使用了 RandomForestRegressor？
当我的测试数据集确实包含实际值时，我如何与实际值如此接近？预测中是否使用了测试数据中的实际值？
如果模型正常工作，如果我将要预测的列（sessions和user_logins）归零，我是否应该得到相同的预测值？

regression machine-learning random-forest scikit-learn

Cra*_*aig

lucky-day

0
推荐指数

1
解决办法

2044
查看次数

如何在Python中可视化回归树

我希望可视化使用scikit learn中的任何集合方法构建的回归树(gradientboosting regressor,random forest regressor,bagging regressor). 我已经看过这个问题了,这个问题涉及分类树.但是这些问题需要"树"方法,这在SKLearn的回归模型中是不可用的.

但它似乎没有产生结果.我遇到了问题,因为.tree这些树的回归版本没有方法(该方法仅适用于分类版本).我想要一个类似于此的输出,但是基于sci kit学习构造的树.

我已经探索了与对象相关的方法,但却无法产生答案.

python machine-learning decision-tree random-forest scikit-learn

use*_*494

2019 10-12

0
推荐指数

1
解决办法

4911
查看次数

使用 Scikit Learn 获取预测元素的百分比

我使用以下代码创建scikit RandomForest 模型并对其进行训练然后保存：

import pandas as pd 
import sklearn
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split
import sklearn.metrics
from sklearn.ensemble import RandomForestClassifier
import pickle

data = pd.read_csv("data_30000_30.csv")
data.head() #Just to give you an idea about how my CSV file looks like
feature_cols = ["width1", "width2", "width3", "width4", "width5", "width6", "width7", "width8", "width9", "width10"]

x = data[feature_cols]
y = data.label
x_train, x_test, y_train, y_test = train_test_split(x, y , test_size = 0.3)



classifier = RandomForestClassifier(n_estimators = 100)
classifier.fit(x_train, …

Run Code Online (Sandbox Code Playgroud)

python pickle random-forest scikit-learn

sin*_*ium

2018 12-21

0
推荐指数

1
解决办法

4554
查看次数

增量拟合 sklearn RandomForestClassifier

我正在使用在每次迭代时生成数据的环境。我想保留先前迭代中的模型并向现有模型添加新数据。
我想了解模型拟合的工作原理。它是将新数据与现有模型拟合，还是使用新数据创建新模型。

调用 fit 新数据：

clf = RandomForestClassifier(n_estimators=100)
for i in customRange:
    get_data()
    clf.fit(new_train_data) #directly fitting new train data
    clf.predict(new_test_data)

Run Code Online (Sandbox Code Playgroud)

或者保存火车数据的历史并调用所有历史数据的拟合是唯一的解决方案

clf = RandomForestClassifier(n_estimators=100)
global_train_data = new dict()
for i in customRange:
    get_data()
    global_train_data.append(new_train_data)  #Appending new train data 
    clf.fit(global_train_data) #Fitting on global train data
    clf.predict(new_test_data)

Run Code Online (Sandbox Code Playgroud)

我的目标是有效地训练模型，所以我不想浪费 CPU 时间重新学习模型。

我想确认正确的方法，还想知道该方法在所有分类器中是否一致

python machine-learning random-forest scikit-learn

Sac*_*gde

2019 03-13

0
推荐指数

1
解决办法

1026
查看次数

使用随机森林时，pdp 包中的“部分”函数出现错误

partial在我的随机森林模型上使用包中的函数时，我收到一条错误消息pdp。我正在尝试使用这个包绘制部分依赖图。

library(randomForest) library(pdp) # random forest model set.seed(101) model_rf <- randomForest(Rec ~ ., data = sample, importance = TRUE) # from pdp package p1 <- partial(model_rf, pred.var = "HDI", plot = TRUE)
Run Code Online (Sandbox Code Playgroud)
然后我在运行到最后一行时收到此错误：

错误：.f必须是函数，而不是 randomForest.formula/randomForest对象

我不确定.f它指的是什么，我在网上找到了完全相同的代码，该partial函数使用随机森林模型工作。

r partial random-forest

作者

2019 07-18

0
推荐指数

1
解决办法

1987
查看次数

如何防止 do.call() 打印模型中类“call”的所有数据框条目？

如果我使用do.call()以列表形式提供的参数来运行模型，则随模型返回的“调用”会列出参数的任何数据框中的所有条目。这将为大型数据集打印极长的模型输出。

library(randomForest) data(iris) do.call(randomForest, list(Species ~ ., data=iris)) #Call: # randomForest(formula = Species ~ ., data = structure(list(Sepal.Length = c(5.1, 4.9, 4.7, #4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, 5.4, 4.8,...
Run Code Online (Sandbox Code Playgroud)
是否可以阻止打印数据帧条目，以便输出与正常模型调用相匹配，例如随机森林？

randomForest(Species ~ ., data=iris) #Call: # randomForest(formula = Species ~ ., data = iris)
Run Code Online (Sandbox Code Playgroud)
我可以尝试在分配模型对象中的“调用”槽后重建并替换它，或者将其设置为NULL，但这似乎是一个糟糕的解决方案。

mod <- do.call(randomForest, list(Species ~ ., data=iris)) mod$call <- 'randomForest(formula = Species ~ ., data = iris)' mod #Call: # "randomForest(formula = Species ~ ., data = iris)"
Run Code Online (Sandbox Code Playgroud)
我确信有更好、更简单的解决方案，但我找不到。预先感谢您的任何帮助。

r random-forest

Ben*_*Ben

lucky-day

0
推荐指数

1
解决办法

78
查看次数

标签统计

random-forest ×13

python ×6

scikit-learn ×6

machine-learning ×4

r ×3

algorithm ×1

apache-spark ×1

apache-spark-mllib ×1

classification ×1

data-science ×1

decision-tree ×1

generator ×1

grid-search ×1

h2o ×1

java ×1

partial ×1

pickle ×1

python-2.7 ×1

regression ×1

types ×1

weka ×1

标签 统计

标签统计