稀疏矩阵 - 稀疏矩阵乘法算法.R的性能改进

Gho*_*osh 5 r openmp sparse-matrix armadillo rcpp

我的mac的R与openblas相关联.当我使用Armadillo在R或Rcpp中执行稀疏稀疏乘法时查看"%CPU"使用情况时,与密集密集乘法不同,它似乎不会使用多线程.速度方面,R或Armadillo中的单线程稀疏稀疏乘法似乎也比Matlab慢.

为了解决这个问题,我已经实现了FG Gustavson的算法(https://dl.acm.org/citation.cfm?id=355796),用于使用Armadillo的spMat容器在Rcpp中执行稀疏稀疏矩阵乘法.

我可以看到的改善(请参见下文),如果我不理它是直接执行该算法的行进行排序,然而,标准订货使得它比的r(慢市价修改mtall的评论).我不是Rcpp/RcppArmadillo/C++的专家,我正在寻找两个具体方面的帮助:

  • 以编程方式,如何基于单线程应用程序使sp_sp_gc_ord函数更高效,更快速?

  • 我在多线程跛脚企图sp_sp_gc_ord的OpenMP导致R键崩溃.我已经注释掉了下面的omp命令.我看过关于OpenMP http://gallery.rcpp.org/tags/openmp/的 Rcpp画廊讨论,但无法弄清楚问题

我将不胜感激任何帮助.下面是一个可重现的代码示例和相应的微基准测试:

#### Rcpp functions

#include <RcppArmadillo.h>
#include<omp.h>
#include<Rcpp.h>

using namespace Rcpp;
using namespace arma;

// [[Rcpp::plugins(openmp)]]
// [[Rcpp::depends(RcppArmadillo)]]


// [[Rcpp::export]]
sp_mat sp_sp_gc_ord(const arma::sp_mat &A, const arma::sp_mat &B, double p){

  // This function evaluates A * B where both A & B are sparse and the resultant
  // product is also sparse

  // define matrix sizes
  const int mA= A.n_rows;
  const int nB= B.n_cols;

  // number of non-zeros in the resultant matrix
  const int nnzC = ceil(mA * nB * p);

  // initialize colptr, row_index and value vectors for the resultant sparse matrix
  urowvec colptrC(nB+1);
  colptrC.zeros();
  uvec rowvalC(nnzC);
  rowvalC.zeros();
  colvec nzvalC(nnzC);

  //setenv("OMP_STACKSIZE","500M",1);

  // counters and other variables
  unsigned int i, jp, j, kp, k, vp; 
  unsigned int ip = 0;
  double nzB, nzA; 
  ivec xb(mA);
  xb.fill(-1);
  vec x(mA);

  // loop logic: outer loop over columns of B and inner loop over columns of A and then aggregate

  //  #pragma omp parallel for shared(colptrC,rowvalC,nzvalC,x,xb,ip,A,B) private(j,nzA,nzB,kp,i,jp,kp,k,vp) default(none) schedule(auto) 
  for(i=0; i< nB; i++) { 

    colptrC.at(i) = ip;

    for ( jp = B.col_ptrs[i]; jp < B.col_ptrs[i+1]; jp++) {

      j = B.row_indices[jp];
      nzB = B.values[jp];

      for ( kp = A.col_ptrs[j]; kp < A.col_ptrs[j+1]; kp++ ){

        k = A.row_indices[kp];
        nzA = A.values[kp];

        if (xb.at(k) != i){
          rowvalC.at(ip) = k;
          ip +=1;
          // Rcpp::print(wrap(ip));
          xb.at(k) = i;
          x.at(k) = nzA * nzB;
        } else {
          x.at(k) += nzA * nzB;
        }
      }
    }

    // put in the value vector of resultant matrix

    if(ip>0){        
      for ( vp= colptrC.at(i); vp <= (ip-1); vp++ ) {
        nzvalC.at(vp) = x(rowvalC.at(vp));
      }

    }
  }

  // resize and put in the spMat container
  colptrC.at(nB) = ip;
  sp_mat C(rowvalC.subvec(0,(ip-1)),colptrC,nzvalC.subvec(0,(ip-1)),mA,nB);

  // Gustavson's algorithm produces unordered rows for each column: a standard way to address this is: (X.t()).t()

  return (C.t()).t();
}  


 // [[Rcpp::export]]
sp_mat sp_sp_arma(const sp_mat &A, const sp_mat &B){

  return A * B; 

} 


// [[Rcpp::export]]
mat dense_dense_arma(const mat &A, const mat &B){

  return A * B; 

} 

#### End 
Run Code Online (Sandbox Code Playgroud)

R中对应的微基准部分:

#### Microbenchmark 

library(Matrix)
library(microbenchmark) 

## define two matrices
m<- 1000
n<- 6000
p<- 2000

A<-  matrix(runif(m*n),m,n)
B<-  matrix(runif(n*p),n,p)
A[abs(A)> .01] = B[abs(B)> .01] = 0
A <- as(A,'dgCMatrix')
B<- as(B,'dgCMatrix')
Adense<- as.matrix(A)
Bdense<- as.matrix(B)

## sp_sp_gc is the function without ordering 

microbenchmark(sp_sp_gc(A,B,.5),sp_sp_arma(A,B),A%*%B,
dense_dense_arma(Adense,Bdense),Adense %*% Bdense,Adense %*% B, times=100)

Unit: milliseconds
                             expr       min        lq      mean    median        uq       max neval
              sp_sp_gc(A, B, 0.5)  16.09809  21.75001  25.76436  24.44657  26.96300  99.30778   100
          sp_sp_gc_ord(A, B, 0.5)  36.78781  44.64558  49.82102  47.64348  51.87361 116.85013   100
                 sp_sp_arma(A, B)  47.45203  52.77132  59.37077  59.24010  62.41710  86.15647   100
                          A %*% B  23.64307  28.99649  32.88566  32.10017  35.21816  59.16251   100
 dense_dense_arma(Adense, Bdense) 286.22358 302.95170 345.66766 317.75786 340.50143 862.15116   100
                Adense %*% Bdense 292.32099 317.10795 342.48345 329.80950 342.21333 697.56468   100
                     Adense %*% B 167.87248 186.63499 219.11872 195.19197 212.50286 843.17172   100   
####
Run Code Online (Sandbox Code Playgroud)

sessionInfo():

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /usr/local/Cellar/openblas/0.3.3/lib/libopenblas_haswellp-r0.3.3.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Matrix_1.2-14           RcppArmadillo_0.8.500.0 Rcpp_0.12.18           

loaded via a namespace (and not attached):
[1] compiler_3.5.1  grid_3.5.1      lattice_0.20-35
Run Code Online (Sandbox Code Playgroud)

RCPP和RcppArmadillo从源代码安装的Mac安装clang4继没穿外衣,的链接https://github.com/coatless/r-macos-rtools