并行运行多个 R 函数

hm6*_*hm6 6 foreach r data.table doparallel

我有一个包含很少数字列和超过 1 亿行的数据集作为 data.table 对象。我想根据其他列对某些列进行分组操作。例如，计算列“d”中每个类别的“a”列的唯一元素。

my_data[, a_count := uniqueN(col_a), col_d]

Run Code Online (Sandbox Code Playgroud)

我有许多这样的操作，它们彼此独立，并行运行它们会很棒。我发现以下代码可以并行运行不同的功能。

fun1 = function(x){
  x[, a_count := uniqueN(col_a), col_d]
  return(x[, .(callId, a_count)])
}
fun2 = function(x){
  x[, b_count := uniqueN(col_b), col_d]
  return(x[, .(callId, b_count)])
}
fun3 = function(x){
  x[, c_count := uniqueN(col_c), col_d]
  return(x[, .(callId, c_count)])
}

tasks = list(job1 = function(x) fun1(x),
             job2 = function(x) fun2(x),
             job3 = function(x) fun3(x))

cl = makeCluster(3)
clusterExport(cl, c('fun1', 'fun2', 'fun3', 'my_data', 'data.table', 'uniqueN'))

out = clusterApply( 
  cl,
  tasks,
  function(f) f(my_data)
)
stopCluster(cl)

Run Code Online (Sandbox Code Playgroud)

我怎样才能改进这个解决方案？例如，最好只将基本列传递给每个函数而不是整个数据帧。

归档时间：	7 年，8 月前
查看次数：	827 次
最近记录：	7 年，8 月前