小编P G*_*mes的帖子

How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:

for(size_t i=0; i<nPt; ++i)
{
    double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];

   for(size_t j=0; j<nCP; ++j)
   {
        // compute distance from i …

Run Code Online (Sandbox Code Playgroud)

c++ optimization x86 intrinsics avx

P G*_*mes

lucky-day

5
推荐指数

1
解决办法

112
查看次数