相关疑难解决方法(0)

如何优化矩阵乘法(matmul)代码以在单个处理器内核上快速运行

我正在研究并行编程概念,并尝试在单核上优化矩阵乘法示例。到目前为止,我想到的最快的实现是:

/* This routine performs a dgemm operation
 *  C := C + A * B
 * where A, B, and C are lda-by-lda matrices stored in column-major format.
 * On exit, A and B maintain their input values. */    
void square_dgemm (int n, double* A, double* B, double* C)
{
  /* For each row i of A */
  for (int i = 0; i < n; ++i)
    /* For each column j of B */
    for (int j = …
Run Code Online (Sandbox Code Playgroud)

c c++ parallel-processing optimization matrix-multiplication

6
推荐指数
2
解决办法
2175
查看次数