To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:
for(size_t i=0; i<nPt; ++i)
{
double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];
for(size_t j=0; j<nCP; ++j)
{
// compute distance from i …Run Code Online (Sandbox Code Playgroud)