“这些采样器不能在并行代码中使用”

Question

“这些采样器不能在并行代码中使用”

hej*_*seb 4 r rcpp

我正在阅读rgen包的插图，该插图提供了一些常见发行版中的采样头。在第一段中，它说：

请注意，这些采样器就像armadillo中的采样器一样，不能在并行化代码中使用，因为基础生成例程依赖于单线程的R调用。

这对我来说是个新闻，而且我已经使用RcppArmadillo已有相当长的一段时间了。我想知道是否有人可以详细说明这一点（或提供我可以从中阅读该问题的参考）。我对在这里学习“不能使用”的含义特别感兴趣；结果会是错误的，还是只是不能并行化？

Answer 1

Ral*_*ner 5

这些函数使用R的随机数生成器，不得在并行代码中使用它，因为这会导致不确定的行为。未定义的行为几乎可以导致任何事情。从我的角度来看，如果程序崩溃，您会很幸运，因为这清楚地告诉您出了点问题。

在HPC任务视图列出了一些随机数发生器适合于并行计算。但是您不能通过rgen或RcppDist提供的发行版轻松使用它们。而是可以执行以下操作：

复制函数，用于rgen通过调整其签名来进行多元正态分布，以便将其std::function<double()>作为N(0, 1)分布式随机数的来源。
使用快速RNG代替R的RNG。
在并行模式下使用相同的快速RNG。

在代码中作为快速技巧：

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
#include <dqrng_distribution.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>

inline arma::mat rmvnorm(unsigned int n, const arma::vec& mu, const arma::mat& S,
                         std::function<double()> rnorm = norm_rand){
  unsigned int ncols = S.n_cols;
  arma::mat Y(n, ncols);
  Y.imbue( rnorm ) ;
  return arma::repmat(mu, 1, n).t() + Y * arma::chol(S);
}

// [[Rcpp::export]]
arma::mat defaultRNG(unsigned int n, const arma::vec& mu, const arma::mat& S) {
  return rmvnorm(n, mu, S);
}

// [[Rcpp::export]]
arma::mat serial(unsigned int n, const arma::vec& mu, const arma::mat& S) {
  dqrng::normal_distribution dist(0.0, 1.0);
  dqrng::xoshiro256plus rng(42);
  return rmvnorm(n, mu, S, [&](){return dist(rng);});
}

// [[Rcpp::export]]
std::vector<arma::mat> parallel(unsigned int n, const arma::vec& mu, const arma::mat& S, unsigned int ncores = 1) {
  dqrng::normal_distribution dist(0.0, 1.0);
  dqrng::xoshiro256plus rng(42);
  std::vector<arma::mat> res(ncores);

  #pragma omp parallel num_threads(ncores)
  {
    dqrng::xoshiro256plus lrng(rng);      // make thread local copy of rng 
    lrng.jump(omp_get_thread_num() + 1);  // advance rng by 1 ... ncores jumps 
    res[omp_get_thread_num()] = rmvnorm(n, mu, S, [&](){return dist(lrng);});
  }
  return res;
}


/*** R
set.seed(42)
N <- 1000000
M <- 100
mu <- rnorm(M)
S <- matrix(rnorm(M*M), M, M)
S <- S %*% t(S)
system.time(defaultRNG(N, mu, S))
system.time(serial(N, mu, S))
system.time(parallel(N/2, mu, S, 2))
*/

Run Code Online (Sandbox Code Playgroud)

结果：

> system.time(defaultRNG(N, mu, S))
   user  system elapsed 
  6.984   1.380   6.881 

> system.time(serial(N, mu, S))
   user  system elapsed 
  4.008   1.448   3.971 

> system.time(parallel(N/2, mu, S, 2))
   user  system elapsed 
  4.824   2.096   3.080

Run Code Online (Sandbox Code Playgroud)

在这里，真正的性能改进来自使用更快的RNG，这是可以理解的，因为这里的重点是许多随机数，而不是矩阵运算。如果我用转向更倾向于矩阵运算N <- 100000和M <- 1000我得到：

> system.time(defaultRNG(N, mu, S))
   user  system elapsed 
 16.740   1.768   9.725 

> system.time(serial(N, mu, S))
   user  system elapsed 
 13.792   1.864   6.792 

> system.time(parallel(N/2, mu, S, 2))
   user  system elapsed 
 14.112   3.900   5.859

Run Code Online (Sandbox Code Playgroud)

在这里，我们清楚地看到，在所有情况下，用户时间都大于经过时间。原因是我正在使用并行BLAS实现（OpenBLAS）。因此，在决定一种方法之前要考虑很多因素。

好吧，我想我现在有所不同。如果我使用在C ++中实现的函数foo执行`foreach`或`parSapply`，则可以明智地设置rng流，但可以，但如果我在C ++中进行并行调用，则需要小心。我的计划是一路使用C ++，因此即使目前我很安全，您的答案对于这种过渡也非常有价值。 (2认同)

归档时间：	7 年，1 月前
查看次数：	69 次
最近记录：	7 年，1 月前