GCC:两个相似循环之间的矢量化差异

las*_*owh 33 c gcc loops vectorization compiler-optimization

在编译时gcc -O3,为什么以下循环没有矢量化(自动):

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foo () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] = b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}
Run Code Online (Sandbox Code Playgroud)

什么时候呢?

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foov () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] += b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}
Run Code Online (Sandbox Code Playgroud)

唯一的区别在于内部循环中的表达式的结果是分配给[i]还是添加到[i]中.

供参考-ftree-vectorizer-verbose=6,为第一个(非矢量化)循环提供以下输出.

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];

v.c:5: note: vectorized 0 loops in function.
Run Code Online (Sandbox Code Playgroud)

矢量化循环的相同输出是:

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
  Vector inside of loop cost: 3
  Vector outside of loop cost: 27
  Scalar iteration cost: 3
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 8

v.c:9: note:   Profitability threshold = 7

v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.
Run Code Online (Sandbox Code Playgroud)

Mys*_*ial 30

在第一种情况下:代码a[i]在每次迭代中覆盖相同的内存位置.由于循环迭代不是独立的,因此这固有地使循环顺序化.
(实际上,实际上只需要最后的迭代.因此可以取出整个内循环.)

在第二种情况下:GCC将循环识别为还原操作 - 对其进行矢量化的特殊情况处理.

编译器矢量化通常被实现为某种"模式匹配".这意味着编译器会分析代码以查看它是否适合它能够进行矢量化的特定模式.如果是,它会被矢量化.如果没有,那就没有.

这似乎是一个极端情况,其中第一个循环不适合GCC可以处理的任何预编码模式.但第二种情况符合"可矢量化减少"模式.


这是GCC源代码的相关部分,它吐出了这条"not vectorized: live stmt not supported: "消息:

http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c

if (STMT_VINFO_LIVE_P (stmt_info))
{
    ok = vectorizable_reduction (stmt, NULL, NULL);

    if (ok)
        need_to_vectorize = true;
    else
        ok = vectorizable_live_operation (stmt, NULL, NULL);

    if (!ok)
    {
        if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
        {
            fprintf (vect_dump, 
                "not vectorized: live stmt not supported: ");
            print_generic_expr (vect_dump, stmt, TDF_SLIM);
        }
        return false;
    }
}
Run Code Online (Sandbox Code Playgroud)

从这条线:

vectorizable_reduction (stmt, NULL, NULL);
Run Code Online (Sandbox Code Playgroud)

很明显,GCC正在检查它是否与"可矢量化减少"模式相匹配.