How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

Question

How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

P G*_*mes 5 c++ optimization x86 intrinsics avx

To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:

for(size_t i=0; i<nPt; ++i)
{
    double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];

   for(size_t j=0; j<nCP; ++j)
   {
        // compute distance from i to j
        double d = sqrt(pow(xi-Xcp[   j   ],2)+
                        pow(yi-Xcp[ j+nCP ],2)+
                        pow(zi-Xcp[j+2*nCP],2));

        // compute the RBF kernel coefficient
        double t = max(0.0,1.0-d);
        t = pow(t*t,2)*(1.0+4.0*d);

        // update coordinates
        for(size_t k=0; k<nDim; ++k) X[i+k*nPt] += t*Ucp[j+k*nCP];
    }
}

Run Code Online (Sandbox Code Playgroud)

nPt is the number of target coordinates and it is much larger than nCP the number of source coordinates/displacements. The latter fit in L3 and so the inner-most loop is always over source points.

First optimization step was to work on 4 target points simultaneously. Source point data was still accessed via scalar loads followed by broadcast.
Second step was to target L1 by blocking the loops, blocking the i-loop was somehow much more important than blocking the j-loop, which gave only a marginal improvement. Inner-most loop is still over j to reduce load/stores.
Third was to load 4 control points and use shuffle/permute to go over the 4 combination of i-j instead of using broadcast.
Fourth, after observing that omitting the square root gives a 1.5x speed up (to about 70% the FP performance of a large LLT on an i7-7700), was to dedicate 4 registers to the computation of the 4 square roots to (maybe?) allow some other computation to take place... 1% improvement vs third step.

Current code

void deform(size_t nPt, size_t nCP, const double* Xcp, const double* Ucp, double* X)
{
    const size_t SIMDLEN = 4;

    // tile ("cache block") sizes
    const size_t TILEH = 512;
    const size_t TILEW = 256;

    // fill two registers with the constants we need
    __m256d vone  = _mm256_set1_pd(1.0),
            vfour = _mm256_set1_pd(4.0);

    // explicitly vectorized (multiple i's at a time) and blocked
    // outer most loop over sets of #TILEH points
    for(size_t i0=0; i0<nPt; i0+=TILEH)
    {
        // displacement buffer, due to tiling, coordinates cannot be modified in-place
        alignas(64) double U[3*TILEH*sizeof(double)];

        // zero the tile displacements
        for(size_t k=0; k<3*TILEH; k+=SIMDLEN)
            _mm256_store_pd(&U[k], _mm256_setzero_pd());

        // stop point for inner i loop
        size_t iend = min(i0+TILEH,nPt);

        // second loop over sets of #TILEW control points
        for(size_t j0=0; j0<nCP; j0+=TILEW)
        {
            // stop point for inner j loop
            size_t jend = min(j0+TILEW,nCP);

            // inner i loop, over #TILEH points
            // vectorized, operate on #SIMDLEN points at a time
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                // coordinates and displacements of points i
                __m256d wi,
                xi = _mm256_load_pd(&X[   i   ]),
                yi = _mm256_load_pd(&X[ i+nPt ]),
                zi = _mm256_load_pd(&X[i+2*nPt]),
                ui = _mm256_load_pd(&U[    i-i0    ]),
                vi = _mm256_load_pd(&U[ i-i0+TILEH ]);
                wi = _mm256_load_pd(&U[i-i0+2*TILEH]);

                // inner j loop, over #TILEW control points, vectorized loads
                for(size_t j=j0; j<jend; j+=SIMDLEN)
                {
                    // coordinates of points j, and an aux var
                    __m256d t,
                    xj = _mm256_load_pd(&Xcp[   j   ]),
                    yj = _mm256_load_pd(&Xcp[ j+nCP ]),
                    zj = _mm256_load_pd(&Xcp[j+2*nCP]);

                    // compute the possible 4 distances from i to j...
                    #define COMPUTE_DIST(D) __m256d                         \
                    D = _mm256_sub_pd(xi,xj);  D = _mm256_mul_pd(D,D);      \
                    t = _mm256_sub_pd(yi,yj);  D = _mm256_fmadd_pd(t,t,D);  \
                    t = _mm256_sub_pd(zi,zj);  D = _mm256_fmadd_pd(t,t,D);  \
                    D = _mm256_sqrt_pd(D)

                    // ...by going through the different permutations
                    #define SHUFFLE(FUN,IMM8)   \
                    xj = FUN(xj,xj,IMM8);       \
                    yj = FUN(yj,yj,IMM8);       \
                    zj = FUN(zj,zj,IMM8)

                    COMPUTE_DIST(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    COMPUTE_DIST(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    COMPUTE_DIST(d3);

                    // coordinate registers now hold the displacements
                    xj = _mm256_load_pd(&Ucp[   j   ]),
                    yj = _mm256_load_pd(&Ucp[ j+nCP ]);
                    zj = _mm256_load_pd(&Ucp[j+2*nCP]);

                    // coefficients for each set of distances...
                    #define COMPUTE_COEFF(C)                                \
                    t = _mm256_min_pd(vone,C);  t = _mm256_sub_pd(vone,t);  \
                    t = _mm256_mul_pd(t,t);     t = _mm256_mul_pd(t,t);     \
                    C = _mm256_fmadd_pd(vfour,C,vone);                      \
                    C = _mm256_mul_pd(t,C)

                    // ...+ update i point displacements
                    #define UPDATE_DISP(C)          \
                    COMPUTE_COEFF(C);               \
                    ui = _mm256_fmadd_pd(C,xj,ui);  \
                    vi = _mm256_fmadd_pd(C,yj,vi);  \
                    wi = _mm256_fmadd_pd(C,zj,wi)

                    UPDATE_DISP(d0);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d1);

                    SHUFFLE(_mm256_permute2f128_pd,1);
                    UPDATE_DISP(d2);

                    SHUFFLE(_mm256_shuffle_pd,0b0101);
                    UPDATE_DISP(d3);
                }

                // store updated displacements
                _mm256_store_pd(&U[    i-i0    ], ui);
                _mm256_store_pd(&U[ i-i0+TILEH ], vi);
                _mm256_store_pd(&U[i-i0+2*TILEH], wi);
            }
        }

        // add tile displacements to the coordinates
        for(size_t k=0; k<3; ++k)
        {
            for(size_t i=i0; i<iend; i+=SIMDLEN)
            {
                __m256d
                x = _mm256_load_pd(&X[i+k*nPt]),
                u = _mm256_load_pd(&U[i-i0+k*TILEH]);
                x = _mm256_add_pd(x,u);
                _mm256_stream_pd(&X[i+k*nPt], x);
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

So what more can I do to it? Or, am I doing something very wrong?

Thank you, P. Gomes

Answer 1

Pet*_*des 3

首先检查性能计数器是否arith.divider_active~= 核心时钟周期。

98%的函数运行时间可以通过取平方根数和运算吞吐量来解释。

或者说这也有效。

如果是这种情况，则说明（未完全流水线化的）分频器吞吐量已饱和，仅暴露更多 ILP 就没有多少收益了。

算法更改是您获得任何东西的唯一真正机会，例如避免某些sqrt操作或使用单精度。

单精度为每个向量免费提供 2 倍的工作量。但对于 sqrt 繁重的工作负载，还有一个额外的好处：每个向量的vsqrtps吞吐量通常优于. Skylake 就是这种情况：每 6 个周期一次，而 vsqrtpd 每 9 到 12 个周期一次。这可能会将瓶颈从 sqrt/divide 单元移至前端或 FMA 单元。vsqrtpd

vrsqrtps已在评论中建议。这值得考虑（如果可以选择单精度），但是当您需要牛顿拉夫森迭代来获得足够的精度时，这并不是一个明显的胜利。没有 Newton Raphson 的Barex * rsqrtps(x)可能太不准确（并且需要 cmp/AND 来解决x==0.0），但是 NR 迭代可能需要太多额外的 FMA uops 而不值得。

（AVX512 的vrsqrt14ps/pd近似精度更高，但通常在没有 Newton 的情况下仍不足以使用。但有趣的是，它确实存在双精度。当然，如果您使用的是 Xeon Phi，sqrt 非常慢，您打算使用AVX512ERvrsqrt28pd + Newton，或单独vrsqrt28ps使用。）

上次我调整了一个包含 Skylake 多项式近似的 sqrt 的函数，快速近似倒数并不值得。硬件单精度 sqrt 是为我们提供所需精度的最佳选择（我们甚至没有考虑需要double）。不过，在 sqrt 操作之间有比您更多的工作。

归档时间：	6 年，9 月前
查看次数：	112 次
最近记录：	6 年，9 月前