我想优化这个短循环

Ing*_*ngo 12 c++ algorithm performance

我想优化这个简单的循环:

unsigned int i;
while(j-- != 0){ //j is an unsigned int with a start value of about N = 36.000.000
   float sub = 0;
   i=1;
   unsigned int c = j+s[1];
   while(c < N) {
       sub += d[i][j]*x[c];//d[][] and x[] are arrays of float
       i++;
       c = j+s[i];// s[] is an array of unsigned int with 6 entries.
   }
   x[j] -= sub;                        // only one memory-write per j
}
Run Code Online (Sandbox Code Playgroud)

使用4000 MHz AMD Bulldozer,该循环的执行时间约为1秒.我想过SIMD和OpenMP(我通常用它来获得更快的速度),但这个循环是递归的.

有什么建议?

小智 10

认为你可能想要转置矩阵d - 意味着以这样的方式存储它你可以交换索引 - 使我成为外部索引:

    sub += d[j][i]*x[c];
Run Code Online (Sandbox Code Playgroud)

代替

    sub += d[i][j]*x[c];
Run Code Online (Sandbox Code Playgroud)

这应该会带来更好的缓存性能.


Dav*_*ave 6

我同意转置更好的缓存(但最后看到我对此的评论),还有更多要做的事情,所以让我们看看我们可以用完整的功能做些什么......

原始功能,供参考(有些整理我的理智):

void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *x, float *b){
    //We want to solve L D Lt x = b where D is a diagonal matrix described by Diagonals[0] and L is a unit lower triagular matrix described by the rest of the diagonals.
    //Let D Lt x = y. Then, first solve L y = b.

    float *y = new float[n];
    float **d = IncompleteCholeskyFactorization->Diagonals;
    unsigned int *s = IncompleteCholeskyFactorization->StartRows;
    unsigned int M = IncompleteCholeskyFactorization->m;
    unsigned int N = IncompleteCholeskyFactorization->n;
    unsigned int i, j;
    for(j = 0; j != N; j++){
        float sub = 0;
        for(i = 1; i != M; i++){
            int c = (int)j - (int)s[i];
            if(c < 0) break;
            if(c==j) {
                sub += d[i][c]*b[c];
            } else {
                sub += d[i][c]*y[c];
            }
        }
        y[j] = b[j] - sub;
    }

    //Now, solve x from D Lt x = y -> Lt x = D^-1 y
    // Took this one out of the while, so it can be parallelized now, which speeds up, because division is expensive
#pragma omp parallel for
    for(j = 0; j < N; j++){
        x[j] = y[j]/d[0][j];
    }

    while(j-- != 0){
        float sub = 0;
        for(i = 1; i != M; i++){
            if(j + s[i] >= N) break;
            sub += d[i][j]*x[j + s[i]];
        }
        x[j] -= sub;
    }
    delete[] y;
}
Run Code Online (Sandbox Code Playgroud)

由于关于并行除法的评论给出了速度提升(尽管只是O(N)),我假设函数本身被调用很多.为什么要分配内存?只需标记x__restrict__并更改yx无处不在(__restrict__是GCC扩展,取自C99.您可能想要使用define它.也许库已经有一个).

同样,虽然我猜您无法更改签名,但您可以使该函数只接受一个参数并对其进行修改.b从未使用时,xy已设置.这也意味着你可以摆脱第一个循环中的分支,它运行~N*M次.memcpy如果必须有2个参数,请在开始时使用.

为什么是d一个指针数组?一定是吗?这在原始代码中似乎太深了,所以我不会碰它,但是如果有任何可能使存储的数组变平,即使你不能转置它也会提速(乘法,加法,取消引用更快)而不是取消引用,添加,取消引用).

所以,新代码:

void MultiDiagonalSymmetricMatrix::CholeskyBackSolve(float *__restrict__ x){
    // comments removed so that suggestions are more visible. Don't remove them in the real code!
    // these definitions got long. Feel free to remove const; it does nothing for the optimiser
    const float *const __restrict__ *const __restrict__ d = IncompleteCholeskyFactorization->Diagonals;
    const unsigned int *const __restrict__ s = IncompleteCholeskyFactorization->StartRows;
    const unsigned int M = IncompleteCholeskyFactorization->m;
    const unsigned int N = IncompleteCholeskyFactorization->n;
    unsigned int i;
    unsigned int j;
    for(j = 0; j < N; j++){ // don't use != as an optimisation; compilers can do more with <
        float sub = 0;
        for(i = 1; i < M && j >= s[i]; i++){
            const unsigned int c = j - s[i];
            sub += d[i][c]*x[c];
        }
        x[j] -= sub;
    }

    // Consider using processor-specific optimisations for this
#pragma omp parallel for
    for(j = 0; j < N; j++){
        x[j] /= d[0][j];
    }

    for( j = N; (j --) > 0; ){ // changed for clarity
        float sub = 0;
        for(i = 1; i < M && j + s[i] < N; i++){
            sub += d[i][j]*x[j + s[i]];
        }
        x[j] -= sub;
    }
}
Run Code Online (Sandbox Code Playgroud)

嗯,它看起来更整洁,缺乏内存分配和减少分支,如果没有别的,是一个提振.如果您可以更改sUINT_MAX在末尾包含额外值,则可以删除更多分支(两个分支i<M,再次运行~N*M次).

现在我们不能再使并行循环,我们不能组合循环.如其他答案所示,现在的提升将重新排列d.除了......重新排列所需d的工作与执行循环的工作具有完全相同的缓存问题.它需要分配内存.不好.进一步优化的唯一选择是:改变IncompleteCholeskyFactorization->Diagonals自身的结构,这可能意味着很多变化,或者找到一种不同的算法,它可以更好地处理这个顺序的数据.

如果你想更进一步,你的优化将需要影响相当多的代码(不是坏事;除非有一个很好的理由Diagonals成为一个指针数组,看起来它可以用一个重构).