小编P G*_*mes的帖子

How to maximise instruction level parallelism of sqrt-heavy-loop on skylake architecture?

To introduced myself to x86 intrinsics (and cache friendliness to a lesser extent) I explicitly vectorized a bit of code I use for RBF (radial basis function) -based grid deformation. Having found vsqrtpd to be the major bottleneck I want to know if/how I can mask its latency further. This is the scalar computational kernel:

for(size_t i=0; i<nPt; ++i)
{
    double xi = X[i], yi = X[i+nPt], zi = X[i+2*nPt];

   for(size_t j=0; j<nCP; ++j)
   {
        // compute distance from i …
Run Code Online (Sandbox Code Playgroud)

c++ optimization x86 intrinsics avx

5
推荐指数
1
解决办法
112
查看次数

标签 统计

avx ×1

c++ ×1

intrinsics ×1

optimization ×1

x86 ×1