)如何在引擎盖下工作？

Mic*_*lli 5 intel simd vectorization avx

我想更详细地了解英特尔编译器使用的simd减少条款是如何工作的.

特别是,对于形式的循环

double x = x_initial;
#pragma simd reduction(<operator1>:x)
for( int i = 0; i < N; i++ )
  x <operator2> some_value;

Run Code Online (Sandbox Code Playgroud)

我的天真猜测如下:编译器为每个向量通道初始化x的私有副本,然后一次遍历循环一个向量宽度.例如,如果矢量宽度是4倍,则这将对应于N/4次迭代加上最后的剥离循环.在迭代的每个步骤中,使用每个通道的x的私有副本进行更新operator2,然后最后使用4个向量通道的私有副本进行组合operator1.该自动向量化的引导似乎并不直接解决这个问题.

我做了一些实验,发现一些结果与我的期望一致,而另一些则没有.例如,我试过这个案子

double x = 1;
#pragma simd reduction(*:x) assert
for( int i = 0; i < 16; i++ )
  x += a[i];  // All elements of a are equal to 3.0
cout << "x after (*:x), x += a[i] loop:  " << x << endl;

Run Code Online (Sandbox Code Playgroud)

其中operator1*和operator2是+ =.当我为avx2编译时,其矢量宽度为4倍,输出为28561 =(1 + 4*a [i])^ 4.这意味着代码首先将x的4个通道专用副本初始化为1,然后将4个双宽矢量通道迭代跨越16个行程计数,每个副本增加3个4次.每个通道专用副本x现在等于13.最后,使用operator2*组合(减少)车道 - 私人副本,产生13*13*13*13 = 28561.

但是,当我切换*和+运算符时,就像这样

x = 1;
#pragma simd reduction(+:x) assert
for( int i = 0; i < 16; i++ )
  x *= a[i];
cout << "x after (+:x), x *= a[i] loop:  " << x << endl;

Run Code Online (Sandbox Code Playgroud)

并为avx2再次编译,输出为1.0.如果我的理论是正确的,那么每个向量通道最终应该包含1*3 ^ 4的值,然后使用+组合得到4*3 ^ 4 = 324.显然情况并非如此.我错过了什么？

归档时间：	9 年，8 月前
查看次数：	908 次
最近记录：	9 年，8 月前