共享内存与R并行foreach

Question

共享内存与R并行foreach

Sta*_*lav 24 parallel-processing foreach r r-bigmemory

问题描述:

我有一个大矩阵c,装在RAM内存中.我的目标是通过并行处理对其进行只读访问.然而,当我创建的连接无论是我使用doSNOW,doMPI,big.matrix等量显着地使用RAM而增加.

有没有办法正确创建共享内存,所有进程可以读取,而不创建所有数据的本地副本？

例:

libs<-function(libraries){# Installs missing libraries and then load them
  for (lib in libraries){
    if( !is.element(lib, .packages(all.available = TRUE)) ) {
      install.packages(lib)
    }
    library(lib,character.only = TRUE)
  }
}

libra<-list("foreach","parallel","doSNOW","bigmemory")
libs(libra)

#create a matrix of size 1GB aproximatelly
c<-matrix(runif(10000^2),10000,10000)
#convert it to bigmatrix
x<-as.big.matrix(c)
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections    
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
out<-foreach(linID = 1:10, .combine=c) %dopar% {
  #load bigmemory
  require(bigmemory)
  # attach the matrix via shared memory??
  m <- attach.big.matrix(mdesc)
  #dummy expression to test data aquisition
  c<-m[1,1]
}
closeAllConnections()

Run Code Online (Sandbox Code Playgroud)

内存:

归档时间：	10 年，7 月前
查看次数：	7909 次
最近记录：	8 年，11 月前

使用R中的foreach读取全局变量 15

更多相关链接

并行化在pandas groupby之后应用 47

最简单的扑克手评估算法 22

如何从目录树构建树形图？ 15

了解glm $ residuals和resid(glm) 11

基于Python的黄土时间序列季节性分解 11

GNU与rsync并行 7

将执行从一个线程移动到另一个线程以实现任务并行和逐个调用 5

GNU parallel：如何格式化替换字符串？ 3

3嵌套for-each循环为Java Stream(或更好的并行流) 3

如何检查 foreach 迭代中的 if 语句中是否第一次找到某个值 1

如何修改现有的,未删除的提交？ 7669

循环内的JavaScript闭包 - 简单实用的例子 2689

你如何断言在JUnit 4测试中抛出某个异常？ 1915

使用Git下载特定标签 1892

如何知道对象在Python中是否具有属性 1491

从Git提交中删除文件 1484

如何在SQL SELECT中执行IF ... THEN？ 1438

使用jQuery作为JS对象向select中添加选项的最佳方法是什么？ 1340

修复一个Git分离的头？ 1318

创建一个带参数的Bash别名？ 1164

Answer 1

NoB*_*own 14

我认为,解决问题的方法可以从史蒂夫韦斯顿,在笔者的职位可以看出foreach包,在这里.在那里他说:

doParallel包将自动导出变量到foreach循环中引用的worker.

所以我认为问题是在你的代码中你的大矩阵c在赋值中被引用c<-m[1,1].试试看xyz <- m[1,1]看会发生什么.

以下是文件支持的示例big.matrix:

#create a matrix of size 1GB aproximatelly
n <- 10000
m <- 10000
c <- matrix(runif(n*m),n,m)
#convert it to bigmatrix
x <- as.big.matrix(x = c, type = "double", 
                 separated = FALSE, 
                 backingfile = "example.bin", 
                 descriptorfile = "example.desc")
# get a description of the matrix
mdesc <- describe(x)
# Create the required connections    
cl <- makeCluster(detectCores ())
registerDoSNOW(cl)
## 1) No referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
  t <- attach.big.matrix("example.desc")
  for (i in seq_len(30L)) {
    for (j in seq_len(m)) {
      y <- t[i,j]
    }
  }
  return(0L)
}

Run Code Online (Sandbox Code Playgroud)

## 2) Referencing
out <- foreach(linID = 1:4, .combine=c) %dopar% {
  invisible(c) ## c is referenced and thus exported to workers
  t <- attach.big.matrix("example.desc")
  for (i in seq_len(30L)) {
    for (j in seq_len(m)) {
      y <- t[i,j]
    }
  }
  return(0L)
}
closeAllConnections()

Run Code Online (Sandbox Code Playgroud)

Answer 2

Ada*_*ski 5

或者，如果您使用的是 Linux/Mac 并且想要 CoW 共享内存，请使用 fork。首先将所有数据加载到主线程中，然后mcparallel从parallel包中启动具有通用功能的工作线程（fork）。

您可以mccollect使用Rdsm库收集它们的结果，或者使用真正的共享内存，如下所示：

library(parallel)
library(bigmemory) #for shared variables
shared<-bigmemory::big.matrix(nrow = size, ncol = 1, type = 'double')
shared[1]<-1 #Init shared memory with some number

job<-mcparallel({shared[1]<-23}) #...change it in another forked thread
shared[1,1] #...and confirm that it gets changed
# [1] 23

Run Code Online (Sandbox Code Playgroud)

您可以确认，如果延迟写入，该值确实会在背景中更新：

fn<-function()
{
  Sys.sleep(1) #One second delay
  shared[1]<-11
}

job<-mcparallel(fn())
shared[1] #Execute immediately after last command
# [1] 23
aaa[1,1] #Execute after one second
# [1] 11
mccollect() #To destroy all forked processes (and possibly collect their output)

Run Code Online (Sandbox Code Playgroud)

要控制并发并避免竞争条件，请使用锁：

library(synchronicity) #for locks
m<-boost.mutex() #Lets create a mutex "m"

bad.incr<-function() #This function doesn't protect the shared resource with locks:
{
  a<-shared[1]
  Sys.sleep(1)
  shared[1]<-a+1
}

good.incr<-function()
{
  lock(m)
  a<-shared[1]
  Sys.sleep(1)
  shared[1]<-a+1
  unlock(m)
}

shared[1]<-1
for (i in 1:5) job<-mcparallel(bad.incr())
shared[1] #You can verify, that the value didn't get increased 5 times due to race conditions

mccollect() #To clear all threads, not to get the values
shared[1]<-1
for (i in 1:5) job<-mcparallel(good.incr())
shared[1] #As expected, eventualy after 5 seconds of waiting you get the 6
#[1] 6 

mccollect()

Run Code Online (Sandbox Code Playgroud)

编辑：

我通过交换Rdsm::mgrmakevar到bigmemory::big.matrix. 无论如何，mgrmakevar内部调用big.matrix，我们不需要更多。