为什么这两个循环在使用-O3编译时运行速度相同,但是在使用-O2编译时却没有？

Question

为什么这两个循环在使用-O3编译时运行速度相同,但是在使用-O2编译时却没有？

在下面的程序中,由于依赖指令,我希望test1运行得更慢.用-O2测试运行似乎证实了这一点.但后来我尝试使用-O3,现在时间或多或少相等.怎么会这样？

#include <iostream>
#include <vector>
#include <cstring>
#include <chrono>

volatile int x = 0; // used for preventing certain optimizations


enum { size = 60 * 1000 * 1000 };
std::vector<unsigned> a(size + x); // `size + x` makes the vector size unknown by compiler 
std::vector<unsigned> b(size + x);


void test1()
{
    for (auto i = 1u; i != size; ++i)
    {
        a[i] = a[i] + a[i-1]; // data dependency hinders pipelining(?)
    }
}


void test2()
{
    for (auto i = 0u; i != size; ++i)
    {
        a[i] = a[i] + b[i]; // no data dependencies
    }
}


template<typename F>
int64_t benchmark(F&& f)
{
    auto start_time = std::chrono::high_resolution_clock::now();
    f();
    auto elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start_time);
    return elapsed_ms.count();
}


int main(int argc, char**)
{   
    // make sure the optimizer cannot make any assumptions
    // about the contents of the vectors:
    for (auto& el : a) el = x;
    for (auto& el : b) el = x;

    test1(); // warmup
    std::cout << "test1: " << benchmark(&test1) << '\n';

    test2(); // warmup        
    std::cout << "\ntest2: " << benchmark(&test2) << '\n';

    return a[x] * x; // prevent optimization and exit with code 0
}

Run Code Online (Sandbox Code Playgroud)

我得到这些结果:

g++-4.8 -std=c++11 -O2 main.cpp && ./a.out
test1: 115
test2: 48

g++-4.8 -std=c++11 -O3 main.cpp && ./a.out
test1: 29
test2: 38

Run Code Online (Sandbox Code Playgroud)

Answer 1

sba*_*bbi 2

因为在-O3gcc 中通过存储的值有效地消除了数据依赖性a[i]在寄存器中并在下一次迭代中重用它而不是加载，有效地消除了数据依赖性a[i-1]。

结果或多或少相当于：

void test1()
{
    auto x = a[0];
    auto end = a.begin() + size;
    for (auto it = next(a.begin()); it != end; ++it)
    {
        auto y = *it; // Load
        x = y + x;
        *it = x; // Store
    }
}

Run Code Online (Sandbox Code Playgroud)

哪个编译在-O2集与编译后的代码完全相同-O3。

您问题中的第二个循环展开于-O3，因此加速。应用的两种优化似乎与我无关，第一种情况更快，只是因为 gcc 删除了加载指令，第二种情况是因为它已展开。

在这两种情况下，我不认为优化器做了任何特别的事情来改善缓存行为，这两种内存访问模式都很容易被 CPU 预测。

归档时间：	10 年前
查看次数：	156 次
最近记录：	10 年前