在R中有效地应用sample()

Mr.*_*Zen 2 r sample probability apply

我需要在给定具有行方式结果概率的矩阵的情况下对结果变量进行采样.

set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)
Run Code Online (Sandbox Code Playgroud)

我能想出的最快方法是apply()和sample()的组合.

#row-wise sampling using these probabilities
classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
Run Code Online (Sandbox Code Playgroud)

但是,在我正在做的事情中,这是计算瓶颈.您是否知道如何加快此代码速度/如何更有效地进行采样?

谢谢!

duc*_*ayr 5

RLave的意见,即Rcpp可能是要走的路是现货(你还需要RcppArmadillo进行sample()); 我使用以下C++代码来创建这样的函数:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadilloExtensions/sample.h>

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set) {
    int n = x.nrow();
    IntegerVector result(n);
    for ( int i = 0; i < n; ++i ) {
        result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];
    }
    return result;
}
Run Code Online (Sandbox Code Playgroud)

然后我通过我的R会话使该功能可用

Rcpp::sourceCpp("sample_matrix.cpp")
Run Code Online (Sandbox Code Playgroud)

现在我们可以在R中针对您的初始方法以及其他使用建议进行测试,purrr::map()并且lapply():

set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)
probabilities_list <- split(probabilities, seq(nrow(probabilities)))

library(purrr)
library(microbenchmark)

microbenchmark(
    apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
    map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
    lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
    rcpp = sample_matrix(probabilities, 1:3),
    times = 100
)

Unit: milliseconds
   expr       min        lq      mean    median        uq       max neval
  apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007   100
    map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898   100
 lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359   100
   rcpp  12.16787  12.55429  13.47837  13.81601  14.25198  16.84859   100
 cld
   c
  b 
  b 
 a  
Run Code Online (Sandbox Code Playgroud)

节省的时间相当可观.

  • @ Mr.Zen当然可以!我很快就会编辑以包含该功能. (2认同)