使用doSMP和foreach的并行随机森林大大增加了内存使用量(在Windows上)

Question

使用doSMP和foreach的并行随机森林大大增加了内存使用量(在Windows上)

use*_*616 7 memory parallel-processing r random-forest

当串行执行随机森林时,它在我的系统上使用8GB的RAM,当并行执行它时,它使用超过两倍的RAM(18GB).如果并行执行此操作,如何将其保持在8GB？这是代码:

install.packages('foreach')
install.packages('doSMP')
install.packages('randomForest')

library('foreach')
library('doSMP')
library('randomForest')

NbrOfCores <- 8 
workers <- startWorkers(NbrOfCores) # number of cores
registerDoSMP(workers)
getDoParName() # check name of parallel backend
getDoParVersion() # check version of parallel backend
getDoParWorkers() # check number of workers


#creating data and setting options for random forests
#if your run this please adapt it so it won't crash your system! This amount of data  uses up to 18GB of RAM.
x <- matrix(runif(500000), 100000)
y <- gl(2, 50000)
#options
set.seed(1)
ntree=1000
ntree2 <- ntree/NbrOfCores


gc()

#running serialized version of random forests

system.time(
rf1 <- randomForest(x, y, ntree = ntree))


gc()


#running parallel version of random forests

system.time(
rf2 <- foreach(ntree = rep(ntree2, 8), .combine = combine, .packages = "randomForest") %dopar% randomForest(x, y, ntree = ntree))

Run Code Online (Sandbox Code Playgroud)

Answer 1

cra*_*ola 0

我认为发生的事情如下。当您的父进程生成子进程时，内存是共享的，即内存使用量不会显着增加。然而，当子进程开始构建随机森林时，它们会创建许多新的中间对象，这些对象不在共享内存中，并且可能相当大。

所以我的答案是，令人失望的是，可能没有简单的方法来解决这个问题，至少使用 randomForest 包——尽管如果有人知道的话我会非常感兴趣。

归档时间：	14 年，1 月前
查看次数：	1726 次
最近记录：	14 年，1 月前