C循环优化有助于最终分配

Question

C循环优化有助于最终分配

Bla*_*147 8 c optimization loops compiler-optimization debug-mode

因此,对于我在计算机系统课程中的最终作业,我们需要优化这些forloops,使其比原始版本更快.使用我们的linux服务器,基本等级不到7秒,完整等级不到5秒.我在这里的代码大约需要5.6秒.我想我可能需要以某种方式使用指针来使它更快,但我不是很确定.任何人都可以提供我的任何提示或选项吗？非常感谢!

QUICKEDIT:文件必须保持50行或更少,我忽略了教师所包含的那些注释行.

#include <stdio.h>
#include <stdlib.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    register double sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0, sum8 = 0, sum9 = 0;
    register int j;
    // ... and this one.

    printf("CS201 - Asgmt 4 - \n");

    for (i = 0; i < N_TIMES; i++)
    {
        // You can change anything between this comment ...
        for (j = 0; j < ARRAY_SIZE; j += 10)
        {
            sum += array[j];
            sum1 += array[j + 1];
            sum2 += array[j + 2];
            sum3 += array[j + 3];
            sum4 += array[j + 4];
            sum5 += array[j + 5];
            sum6 += array[j + 6];
            sum7 += array[j + 7];
            sum8 += array[j + 8];
            sum9 += array[j + 9];
        }
        // ... and this one. But your inner loop must do the same
        // number of additions as this one does.
    }                   

    // You can add some final code between this comment ...
    sum += sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8 + sum9;
    // ... and this one.

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Pet*_*des 44

从C中的双精度数组的优化总和重新发布我的答案的修改版本,因为该问题被投票降至-5.另一个问题的OP更多地将其称为"还有什么是可能的",所以我接受了他关于矢量化和调整当前CPU硬件的信息和信息.:)

该问题的OP最终表示他不允许使用高于的编译器选项-O0,我猜这也是如此.

摘要:

为什么使用-O0扭曲的东西(不公平地惩罚普通编译器的正常代码中的罚款).
这个任务有问题.
优化类型.FP延迟与吞吐量和依赖关系链.链接到Agner Fog的网站.(优化的基本阅读).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -O0, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).

我认为赋值的重点是使用C进行汇编语言性能优化,而无需编译器优化.这太傻了.它混合了编译器在现实生活中为需要源级更改的事情做的事情.

-ffast-math doesn't just "not optimize", it makes the compiler store variables to memory after every statement, instead of keeping them in registers. It does this so you get the "expected" results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you -O0 to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of jump requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it's horrible for tiny loops (e.g. 1 cycle per iteration vs. 6c bottleneck on loop counter updates).

With -O0, the inc reg keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that's only with inc [mem]. In real code, gcc -O0 is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there's a proposal to remove it from the language along with other obsolete stuff like trigraphs.

With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.

Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like register, so it's a good idea to change the source to do the strength-reduction optimization of turning the multiplies into register adds.

At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like -O0 which are as fast as array[i*width + j*width*height]. except on Sandybridge and later.) It's the compiler's job to optimize your code by using pointer incrementing even when the source uses array indexing, when that's faster.

为了获得良好的性能,您必须了解编译器可以做什么和不能做什么.一些优化是"脆弱的",对源的一个看似无辜的小改变将阻止编译器进行优化,这对于某些代码快速运行至关重要.(例如,从循环中拉出一个常量计算,或证明一些关于不同分支条件如何相互关联,并简化.)

除此之外,它是一个垃圾样本,因为它没有任何东西阻止智能编译器优化掉整个事物.它甚至不打印总和.甚至+=(而不是[rsi + rdx*4])丢掉了一些循环.

(你可以通过最后的打印来解决这个问题[rdi].gcc和clang似乎没有意识到gcc -O1返回归零的内存,并将其优化为-O3.请参阅下面的代码.)

通常,您将代码放在一个函数中,并sum在另一个文件的循环中调用它.并且单独编译它们,没有全程序跨文件优化,因此编译器无法根据您调用它的编译时常量进行优化.重复循环被紧紧缠绕在阵列上的实际循环周围,这会对gcc的优化器造成严重破坏(见下文).

此外,这个问题的另一个版本有一个未初始化的变量.它似乎calloc是OP提出的那个问题,而不是教授.因此,我必须将我的"完全废话"降级为"愚蠢",因为代码甚至不会在结尾打印结果.这是让编译器不要像这样在微基准测试中优化所有内容的最常用方法.

我假设你的教授提到了一些关于性能的事情.有一些不同的东西可以在这里发挥作用,其中许多我认为在第二年的CS课程中没有提到.

除了使用openmp进行多线程处理外,还有使用SIMD进行矢量化.现代流水线CPU也有优化:具体来说,避免使用一个长的依赖链.

Further essential reading:

Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory

Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use 0.0 to allow it.

Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you're doing with the main()..long int help unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.

If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)

In this case, there's also basic stuff like pulling things out of the loop, e.g. -ffast-math.

compiler options

Lets start by seeing what the compiler can do for us.

I started out with the original inner loop, with just sum0 pulled out, and adding a sum9 at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):

help += ARRAY_SIZE: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing/tuning code with help += ARRAY_SIZE is not useful.
printf: 4.87s
gcc -O0 fast-loop-cs201.c -o fl: 4.89s
-O0: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -O1 is baseline.)
-O2: 2.439s
-O3: 1.275s (uses AVX to do 4 at once.)
-msse2: no gain
-O3 -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-O3 -march=sandybridge -ffast-math -funroll-loops, run it, then
-Ofast ...: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling/inlining decisions.
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 1.070s. (clang 3.5 is too old to support -Ofast -fprofile-generate -march=sandybridge -ffast-math. You should prefer to use a compiler version that's new enough to know about the target architecture you're tuning for, esp. if using -Ofast -fprofile-use -march=sandybridge -ffast-math to make code that doesn't need to run on older architectures.)

clang-3.5 -Ofast -march=native -ffast-math vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an -march=sandybridge on that. So it sees the same values are being added repeatedly, but even -march doesn't let gcc just turn it into a multiply. Or switch the loops.

clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that gcc -O3 returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.

vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4

Run Code Online (Sandbox Code Playgroud)

It's actually slower when I tell it that the array is aligned. (with a stupid hack like addpd which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's -ffast-math.) I think the way the tight loop of 4x calloc is aligned puts array = (double*)((ptrdiff_t)array & ~31); crossing a 32B boundary, so it can't macro-fuse with __builtin_assume_aligned. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops/cycle), according to vaddpd mem, %ymmX,%ymmX.

Ahh, I checked with a debugger, and cmp $0x271c,%rcx is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables jne and perf for calloc, and also for the default tune=generic with -mavx256-split-unaligned-load, which is not so good especially for Haswell or with memory that's usually aligned by the compiler doesn't know about it.)

Source level changes

As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:

for (j = 0; j < ARRAY_SIZE; j+=4) {  // unroll 4 times
    sum0 += array[j];
    sum1 += array[j+1];
    sum2 += array[j+2];
    sum3 += array[j+3];
}

Run Code Online (Sandbox Code Playgroud)

and then don't collect the 4 accumulators into one until after the end of the outer loop.

Your (from the other question) source change of

sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];

Run Code Online (Sandbox Code Playgroud)

actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the ...-store values get added together first, and then added to -march=sandybridge. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.

With -mavx, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make j run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with sum, esp. not at the expense of readability.

Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O0 with one accumulator). Even -O0 without -O0 also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.

gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.

Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:

// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>

// You are only allowed to make changes to this code as specified by the comments in it.

// The code you submit must have these two values.
#define N_TIMES     600000
#define ARRAY_SIZE   10000

int main(void)
{
    double  *array = calloc(ARRAY_SIZE, sizeof(double));
    double  sum = 0;
    int     i;

    // You can add variables between this comment ...
    long int help = 0;

    typedef double v4df __attribute__ ((vector_size (8*4)));
    v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};

    const size_t array_bytes = ARRAY_SIZE*sizeof(double);
    double *aligned_array = NULL;

    // this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
    if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
        exit (1);
    }
    memcpy(aligned_array, array, array_bytes);  // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop

    // ... and this one.

    // Please change 'your name' to your actual name.
    printf("CS201 - Asgmt 4 - I. Forgot\n");

    for (i = 0; i < N_TIMES; i++) {

        // You can change anything between this comment ...
    /*
    #if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
        array = __builtin_assume_aligned(array, 32);
    #else
        // force-align for other compilers.  This loop-invariant will be done outside the loop.
        array = (double*) ((ptrdiff_t)array & ~31);
    #endif
    */

        assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) );  // We don't have a cleanup loop to handle where the array size isn't a multiple of 16


        // incrementing pointers can be more efficient than indexing arrays
        // esp. on recent Intel where micro-fusion only works with one-register addressing modes
        // of course, the compiler can always generate pointer-incrementing asm from array-indexing source
        const double *start = aligned_array;

        while ( (ptrdiff_t)start & 31 ) {
            // annoying loops like this are the reason people use aligned buffers
            sum += *start++;        // scalar until we reach 32B alignment
            // in practice, this loop doesn't run, because we copy into an aligned buffer
            // This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
        }

        const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
        for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
            sum0 += p[0];   // p+=4 increments the pointer by 4 * 4 * 8 bytes
            sum1 += p[1];       // make sure you keep track of what you're incrementing
            sum2 += p[2];
            sum3 += p[3];

        }

        // the compiler might be smart enough to pull this out of the inner loop
        // in fact, gcc turns this into a 64bit movabs outside of both loops :P
        help+= ARRAY_SIZE;

            // ... and this one. But your inner loop must do the same
            // number of additions as this one does.

        /* You could argue legalese and say that
         if (i == 0) {
             for (j ...)
                 sum += array[j];
             sum *= N_TIMES;
         }
         * still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
         */
    }

    // You can add some final code between this comment ...
    sum0 = (sum0 + sum1) + (sum2 + sum3);
    sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
    printf("sum = %g; help=%ld\n", sum, help);  // defeat the compiler.

    free (aligned_array);
    free (array);  // not strictly necessary, because this is the end of main().  Leaving it out for this special case is a bad example for a CS class, though.
    // ... and this one.

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

The inner loop compiles to:

  4007c0:       c5 e5 58 19             vaddpd (%rcx),%ymm3,%ymm3
  4007c4:       48 83 e9 80             sub    $0xffffffffffffff80,%rcx   # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
  4007c8:       c5 f5 58 49 a0          vaddpd -0x60(%rcx),%ymm1,%ymm1   # one-register addressing mode can micro-fuse
  4007cd:       c5 ed 58 51 c0          vaddpd -0x40(%rcx),%ymm2,%ymm2
  4007d2:       c5 fd 58 41 e0          vaddpd -0x20(%rcx),%ymm0,%ymm0
  4007d7:       4c 39 c1                cmp    %r8,%rcx  # compare with end with p
  4007da:       75 e4                   jne    4007c0 <main+0xb0>

Run Code Online (Sandbox Code Playgroud)

(For more, see online compiler output at the godbolt compiler explorer. The -O2 compiler option compiles as C, not C++. The inner loop is from gcc -O2 to -ffast-math. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog's guides don't cover).

performance:

$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec 
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000

 Performance counter stats for './fl3-vec':

       1086.571078      task-clock (msec)         #    1.000 CPUs utilized          
     4,072,679,849      cycles                    #    3.748 GHz                    
     2,629,419,883      instructions              #    0.65  insns per cycle        
                                                  #    1.27  stalled cycles per insn
     4,028,715,968      r1b1                      # 3707.733 M/sec  # unfused uops
     2,257,875,023      r10e                      # 2077.982 M/sec  # fused uops.  lower than insns because of macro-fusion
     3,328,275,626      stalled-cycles-frontend   #   81.72% frontend cycles idle   
     1,648,011,059      stalled-cycles-backend    #   40.47% backend  cycles idle   
       751,736,741      L1-dcache-load-misses     #  691.843 M/sec                  
            18,772      cache-misses              #    0.017 M/sec                  

       1.086925466 seconds time elapsed

Run Code Online (Sandbox Code Playgroud)

I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.

32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).

Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.

0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/

update: I guess L2 bandwidth was the problem after all. There aren't enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle. L2 sustained bandwidth is less than peak on Intel SnB/Haswell/Skylake CPUs.

See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how -xc is one possible bottleneck. See also the "Latency Bound Platforms" part of the answer to Enhanced REP MOVSB for memcpy; limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.

Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for

Answer 2

pax*_*blo 4

您可能走在正确的轨道上，尽管您需要对其进行测量才能确定（我通常的建议是测量，而不是猜测，这里似乎有点多余，因为作业的全部要点就是测量）。

优化编译器可能不会看到太大的差异，因为它们在这类事情上非常聪明，但是，由于我们不知道它将在什么优化级别上进行编译，因此您可能会得到实质性的改进。

要在内循环中使用指针很简单，首先添加一个指针变量：

register double *pj;

Run Code Online (Sandbox Code Playgroud)

然后将循环更改为：

for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
        sum += *j++;
        sum1 += *j++;
        sum2 += *j++;
        sum3 += *j++;
        sum4 += *j++;
        sum5 += *j++;
        sum6 += *j++;
        sum7 += *j++;
        sum8 += *j++;
        sum9 += *j;
    }

Run Code Online (Sandbox Code Playgroud)

这使循环内的加法量保持相同（当然，假设您正在计数+=并作为加法运算符），但基本上使用指针而不是数组索引。++

在我的系统上没有优化^{1 的}情况下，这会将其从 9.868 秒（CPU 时间）降低到 4.84 秒。你的旅费可能会改变。

¹ 在优化级别下-O3，两者均报告花费了 0.001 秒，因此，如上所述，优化器非常聪明。然而，考虑到您看到的时间超过 5 秒，我建议它没有在优化的情况下进行编译。

顺便说一句，这就是为什么通常建议以可读的方式编写代码并让编译器负责使其运行得更快的一个很好的理由。虽然我在优化方面的微薄尝试大约使速度增加了一倍，但使用却-O3使它运行速度快了一万倍:-)

编译器非常聪明，但它们不必那么好，会丢弃从未使用过的结果的计算。在我看来，这不是一个很好的家庭作业。 (5认同)
@pax：对于非优化编译器，将结束指针保留在其自己的变量中将会产生影响。循环之前的“double *endp = array + ARRAY_SIZE”。否则，“gcc -O0”可能会在每次迭代时发出计算“array+ARRAY_SIZE”的代码。这是为什么这很愚蠢的另一个例子。哦，你可能还会用 `j[0]`、`j[1]` 等做得更好，然后将 `j` 增加 10。希望这会生成带有 `[rsi]` 的 asm， `[rsi + 8]`、`[rsi + 16]`，而不是加载 `j`、递增并存储每个 double。 (2认同)

归档时间：	10 年，2 月前
查看次数：	5650 次
最近记录：	6 年，2 月前