为什么犰狳在简单的逐行计算任务中比 C 风格数组慢

Question

为什么犰狳在简单的逐行计算任务中比 C 风格数组慢

我目前正在为大矩阵（数百万行，列数< 1000）的每个值计算少量，同时独立考虑每一行。

更准确地说，对于该矩阵的每行i、列j中的每个值M ( i , j ) ，其数量就是 [ M ( i , j ) -mean( i , s ) ] / std( i , s ) ，其中s是M ( i ,:) - j中的子集s ， 换句话说，s是行i中没有值j的所有值的子集。

我比较了两种实现，一种是 C 风格数组，另一种是犰狳，犰狳在执行时间方面大约慢了一倍。我预计执行时间类似或稍慢，但普通 C 数组似乎可以显着提高性能。

有什么特别的原因或我在某个地方错过的事情吗？这是一个使用以下命令编译的示例-O2 -lstdc++ -DARMA_DONT_USE_WRAPPER -lopenblas -llapack -lm：也尝试使用ARMA_NO_DEBUG没有成功。

#include <string> #include <vector> #include <iostream> #include <fstream> #include <algorithm> #include <armadillo> #include <chrono> using namespace std::chrono; /*************************** * main() ***************************/ int main( int argc, char *argv[] ) { unsigned nrows = 2000000; //number of rows unsigned ncols = 100; //number of cols const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_cols-1, huge_mat.n_cols); //create a vector of [0,...,n] arma::rowvec inds = arma::zeros<arma::rowvec>( huge_mat.n_cols-1 ); //-1 since we remove only one value at each step. arma::colvec simuT = arma::zeros<arma::colvec>( ncols ); //let's store the results in this simuT vector. high_resolution_clock::time_point t1 = high_resolution_clock::now(); //compute some normalization over each value of line of this huge matrix: for(unsigned i=0; i < nrows; i++) { const arma::rowvec current_line = huge_mat.row(i); //extract current line //for each observation in current_line: for(unsigned j=0; j < ncols; j++) { //Take care of side effects first: if( j == 0 ) inds = current_line(arma::span(1, ncols-1)); else if( j == 1 ) { inds(0) = current_line(0); inds(arma::span(1, ncols-2)) = current_line( arma::span(2, ncols-1) ); } else inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) ); //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line. simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) ncols-1) ) * arma::stddev(inds) ); } } high_resolution_clock::time_point t2 = high_resolution_clock::now(); auto duration = duration_cast<seconds>( t2 - t1 ).count(); std::cout << "ARMADILLO: " << duration << " secs\n"; //------------------PLAIN C Array double *Mat_full; double *output; unsigned int i,j,k; double mean=0, stdd=0; double sq_diff_sum = 0, sum=0; double diff = 0; Mat_full = (double *) malloc(ncols * nrows * sizeof(double)); output = (double *) malloc(nrows * ncols * sizeof(double)); std::vector< std::vector<double> > V(huge_mat.n_rows); //Some UGLY copy from arma::mat to double* using a vector: for (size_t i = 0; i < huge_mat.n_rows; ++i) V[i] = arma::conv_to< std::vector<double> >::from(huge_mat.row(i)); //then dump to Mat_full array: for (i=0; i < V.size(); i++) for (j=0; j < V[i].size(); j++) Mat_full[i + huge_mat.n_rows * j] = V[i][j]; t1 = high_resolution_clock::now(); for(i=0; i < nrows; i++) for(j=0; j < ncols; j++) { //compute mean of subset------------------- sum = 0; for(k = 0; k < ncols; k++) if(k!=j) { sum = sum + Mat_full[i+k*nrows]; } mean = sum / (ncols-1); //compute standard deviation of subset----- sq_diff_sum = 0; for(k = 0; k < ncols; k++) if(k!=j) { diff = Mat_full[i+k*nrows] - mean; sq_diff_sum += diff * diff; } stdd = sqrt(sq_diff_sum / (ncols-2)); //export to plain C array: output[i*ncols+j] = (Mat_full[i+j*nrows] - mean) / (sqrt(1+1/(((double) ncols)-1))*stdd); } t2 = high_resolution_clock::now(); duration = duration_cast<seconds>( t2 - t1 ).count(); std::cout << "C ARRAY: " << duration << " secs\n"; }
Run Code Online (Sandbox Code Playgroud)
特别是在比较执行时间时，对 arma::mean 和 arma::stddev 的调用似乎表现不佳。我没有对性能的尺寸效应进行任何深入分析，但似乎对于较小的nrows纯 C 值往往（非常）更快。对于使用此设置的简单测试，我得到：

ARMADILLO: 111 secs C ARRAY: 79 secs
Run Code Online (Sandbox Code Playgroud)
在执行时间。

编辑这里是修改，我们按列而不是按行工作，并独立处理每一列，如 @rubenvb 和 @mtall 所建议的。结果执行时间略有减少（ARMADILLO: 104 secs现在），从而显示出比按行工作有一些改进：

#include <string> #include <vector> #include <iostream> #include <fstream> #include <algorithm> #include <armadillo> #include <chrono> using namespace std::chrono; /*************************** * main() ***************************/ int main( int argc, char *argv[] ) { unsigned nrows = 100; //number of rows unsigned ncols = 2000000; //number of cols const arma::mat huge_mat = arma::randn(nrows, ncols); //create huge matrix const arma::uvec vec = arma::linspace<arma::uvec>( 0, huge_mat.n_rows-1, huge_mat.n_rows); //create a vector of [0,...,n] arma::colvec inds = arma::zeros<arma::colvec>( huge_mat.n_rows-1 ); //-1 since we remove only one value at each step. arma::rowvec simuT = arma::zeros<arma::rowvec>( nrows ); //let's store the results in this simuT vector. high_resolution_clock::time_point t1 = high_resolution_clock::now(); //compute some normalization over each value of line of this huge matrix: for(unsigned i=0; i < ncols; i++) { const arma::colvec current_line = huge_mat.col(i); //extract current line //for each observation in current_line: for(unsigned j=0; j < nrows; j++) { //Take care of side effects first: if( j == 0 ) inds = current_line(arma::span(1, nrows-1)); else if( j == 1 ) { inds(0) = current_line(0); inds(arma::span(1, nrows-2)) = current_line( arma::span(2, nrows-1) ); } else inds(arma::span(0, j-1)) = current_line( arma::span(0, j-1) ); //Let's do some computation: huge_mat(i,j) - mean[huge_mat(i,:)] / std([huge_mat(i,:)]) //can compute the mean and std first... for each line. simuT(j) = (current_line(j) - arma::mean(inds)) / ( std::sqrt( 1+1/((double) nrows-1) ) * arma::stddev(inds) ); } } high_resolution_clock::time_point t2 = high_resolution_clock::now(); auto duration = duration_cast<seconds>( t2 - t1 ).count(); std::cout << "ARMADILLO: " << duration << " secs\n"; }
Run Code Online (Sandbox Code Playgroud)

Answer 1

The*_*ist 5

原因是犰狳在 mat 中使用列优先排序，而 C 数组使用行优先排序。这是一件大事，因为您的处理器可以使用指令向量化来一次处理多个元素，这需要连续的内存块。

要验证这是否是原因，请对列而不是行进行相同的计算，并检查差异。

谢谢，我已经在原帖的编辑中做了建议的修改，它提高了效率。 (2认同)

归档时间：	7 年，1 月前
查看次数：	1847 次
最近记录：	7 年，1 月前