0xE*_*xE6 27 c++ optimization performance gcc c++11
所以,用下面的代码,改变参数的类型x
从const ull
到const ull&
(有typedef unsigned long long ull
)在约25%的加速效果,当用gcc 4.7.2和标志编译-O3 -std=c++11 -g
,我想不通为什么会做出如此大的差异.
static void inline single_mult(const std::vector<ull>::iterator& data,
const std::vector<ull>::const_iterator& rbegin,
const std::vector<ull>::const_iterator& rend,
const ull x) {
ull tmp=0, carry=0, i=0;
for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
tmp = x*(*rhs_it) + data[i] + carry;
if (tmp >= imax) {
carry = tmp >> numbits;
tmp &= imax - 1;
} else {
carry = 0;
}
data[i++] = tmp;
}
data[i] += carry;
}
Run Code Online (Sandbox Code Playgroud)
它在以下函数中调用(用于进行教科书长乘法)
static void naive(std::vector<ull>::iterator data,
std::vector<ull>::const_iterator cbegin,
std::vector<ull>::const_iterator cend ,
std::vector<ull>::const_iterator rbegin,
std::vector<ull>::const_iterator rend) {
for (auto data_it = cbegin;
data_it != cend; ++data_it) {
if (*data_it != 0) {
single_mult(data, rbegin, rend, *data_it);
}
++data;
}
}
Run Code Online (Sandbox Code Playgroud)
定时是通过调用clock()
循环来测量它需要多长时间来完成的.不是最准确/最精确的方式,但我认为一致的25%差异意味着差异具有统计学意义.
完整的代码块:
#include <vector>
#include <limits>
#include <ctime>
#include <iostream>
typedef unsigned long long ull;
typedef unsigned int uint;
const ull numbits = (ull) std::numeric_limits<uint>::digits;
const ull imax = 1LL << numbits;
static void inline single_mult(const std::vector<ull>::iterator& data,
const std::vector<ull>::const_iterator& rbegin,
const std::vector<ull>::const_iterator& rend,
const ull x) {
ull tmp=0, carry=0, i=0;
for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
tmp = x*(*rhs_it) + data[i] + carry;
if (tmp >= imax) {
carry = tmp >> numbits;
tmp &= imax - 1;
} else {
carry = 0;
}
data[i++] = tmp;
}
data[i] += carry;
}
static void naive(std::vector<ull>::iterator data,
std::vector<ull>::const_iterator cbegin,
std::vector<ull>::const_iterator cend ,
std::vector<ull>::const_iterator rbegin,
std::vector<ull>::const_iterator rend) {
for (auto data_it = cbegin; data_it != cend; ++data_it) {
if (*data_it != 0) {
single_mult(data, rbegin, rend, *data_it);
}
++data;
}
}
int main() {
int N = 10000;
std::vector<ull> vec(N);
for (int i=0; i<N; ++i) {
vec[i] = i;
}
auto s1 = clock();
int num = 10;
std::vector<ull> result(2*N);
for (int i=0; i<num; ++i) {
//Need to zero out the vector or the result will be different.
std::fill(result.begin(), result.end(), 0);
naive(result.begin(), vec.cbegin(), vec.cend(), vec.cbegin(), vec.cend());
}
s1 = (clock() - s1);
std::cout << "Took " << float(s1)/CLOCKS_PER_SEC << "seconds total." << std::endl;
}
Run Code Online (Sandbox Code Playgroud)
和运行时(我命名传递值value.cpp
和引用的文件reference.cpp
)
$ g++ -O3 -std=c++11 -g -o reference reference.cpp
$ g++ -O3 -std=c++11 -g -o value value.cpp
$ ./reference
Took 1.05seconds total.
$ ./value
Took 1.83seconds total.
Run Code Online (Sandbox Code Playgroud)
amd*_*mdn 30
I was able to reproduce your observation of the speedup, it was even more noticeable for me (1.75x faster). The problem seems to be when you pass x by value it enables the compiler to perform optimizations that it otherwise doesn't do, and these optimizations backfire, they apparently introduce an unintended stall in the processor. The compiler generates a couple of conditional moves, back to back, rather than compare and branch. The compare and branch runs much faster than the back-to-back conditional moves.
I was able to avoid this by simplifying the code that the compiler had trouble with, namely this
if (tmp >= imax) {
carry = tmp >> numbits;
tmp &= imax - 1;
} else {
carry = 0;
}
Run Code Online (Sandbox Code Playgroud)
can be reduced to
carry = tmp >> numbits;
tmp &= imax - 1;
Run Code Online (Sandbox Code Playgroud)
Here's the version of gcc I'm using
g++ --version
g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)
Run Code Online (Sandbox Code Playgroud)
These are the commands I used, perf record
will profile your code and perf report
will annotate the source and disassembly with the results of the profile
g++ -std=gnu++0x -O3 -g single_mult.cpp -o single_mult
perf record ./single_mult
perf report
Run Code Online (Sandbox Code Playgroud)
Once in perf report press enter
on main
and select Annotate main
, you'll see a disassembly of your program along with source code and the percent of the time the profiler found your program running at each instruction in the function... actually these numbers must be taken as a hint only, often you will see instructions with big counts when in fact it was the previous instruction that stalled, or missed in the cache, or there was a mis-predicted branch, etc. So when you see a big count look back to see what may have caused it. The reason is the profile is statistical, it interrupts your program at a constant rate and looks to see where the instruction pointer is, and often the interrupt happens while the processor is stalled due to a cache miss or a mis-predicted branch or some internal data dependency.
我增加了迭代次数,以便为探查器留出更多时间
int N = 20000;
Run Code Online (Sandbox Code Playgroud)
当x通过值传递时Took 11.58seconds total
,这就是我所看到的,请注意cmovbe
说明
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
11.10 : 400b40: 4c 89 c9 mov %r9,%rcx
0.00 : 400b43: 48 89 fa mov %rdi,%rdx
0.01 : 400b46: 48 03 14 c6 add (%rsi,%rax,8),%rdx
11.65 : 400b4a: 48 0f af 0c c3 imul (%rbx,%rax,8),%rcx
0.99 : 400b4f: 48 01 ca add %rcx,%rdx
: if (tmp >= imax) {
: carry = tmp >> numbits;
2.25 : 400b52: 48 89 d7 mov %rdx,%rdi
: tmp &= imax - 1;
10.99 : 400b55: 48 89 d1 mov %rdx,%rcx
: const ull x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
: if (tmp >= imax) {
: carry = tmp >> numbits;
0.69 : 400b58: 48 c1 ef 20 shr $0x20,%rdi
: tmp &= imax - 1;
9.54 : 400b5c: 83 e1 ff and $0xffffffff,%ecx
9.05 : 400b5f: 4c 39 c2 cmp %r8,%rdx
10.78 : 400b62: 49 0f 46 fb cmovbe %r11,%rdi
: } else {
: carry = 0;
: }
: data[i++] = tmp;
20.73 : 400b66: 48 83 c0 01 add $0x1,%rax
0.02 : 400b6a: 4c 39 c2 cmp %r8,%rdx
0.17 : 400b6d: 48 0f 46 ca cmovbe %rdx,%rcx
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
11.47 : 400b71: 4c 39 d0 cmp %r10,%rax
: carry = tmp >> numbits;
: tmp &= imax - 1;
: } else {
: carry = 0;
: }
: data[i++] = tmp;
0.01 : 400b74: 48 89 4c c6 f8 mov %rcx,-0x8(%rsi,%rax,8)
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
0.53 : 400b79: 75 c5 jne 400b40 <main+0x250>
: } else {
: carry = 0;
: }
: data[i++] = tmp;
: }
: data[i] += carry;
0.00 : 400b7b: 4a 01 3c d6 add %rdi,(%rsi,%r10,8)
0.01 : 400b7f: 48 83 c5 08 add $0x8,%rbp
Run Code Online (Sandbox Code Playgroud)
当x通过引用传递时Took 6.59seconds total
,这就是我所看到的
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
20.90 : 400b30: 48 8b 17 mov (%rdi),%rdx
: tmp = x*(*rhs_it) + data[i] + carry;
1.38 : 400b33: 49 0f af 14 c1 imul (%r9,%rax,8),%rdx
4.82 : 400b38: 48 03 0c c6 add (%rsi,%rax,8),%rcx
22.41 : 400b3c: 48 01 ca add %rcx,%rdx
: if (tmp >= imax) {
: carry = tmp >> numbits;
: tmp &= imax - 1;
: } else {
: carry = 0;
2.95 : 400b3f: 31 c9 xor %ecx,%ecx
: const std::vector<ull>::const_iterator& rend,
: const ull &x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
: if (tmp >= imax) {
0.23 : 400b41: 4c 39 d2 cmp %r10,%rdx
0.00 : 400b44: 76 0a jbe 400b50 <main+0x260>
: carry = tmp >> numbits;
2.27 : 400b46: 48 89 d1 mov %rdx,%rcx
: tmp &= imax - 1;
1.29 : 400b49: 83 e2 ff and $0xffffffff,%edx
: const ull &x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
: tmp = x*(*rhs_it) + data[i] + carry;
: if (tmp >= imax) {
: carry = tmp >> numbits;
0.26 : 400b4c: 48 c1 e9 20 shr $0x20,%rcx
: tmp &= imax - 1;
: } else {
: carry = 0;
: }
: data[i++] = tmp;
19.67 : 400b50: 48 83 c0 01 add $0x1,%rax
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull &x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
0.53 : 400b54: 4c 39 c0 cmp %r8,%rax
: carry = tmp >> numbits;
: tmp &= imax - 1;
: } else {
: carry = 0;
: }
: data[i++] = tmp;
0.39 : 400b57: 48 89 54 c6 f8 mov %rdx,-0x8(%rsi,%rax,8)
: static void inline single_mult(const std::vector<ull>::iterator& data,
: const std::vector<ull>::const_iterator& rbegin,
: const std::vector<ull>::const_iterator& rend,
: const ull &x) {
: ull tmp=0, carry=0, i=0;
: for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
22.91 : 400b5c: 75 d2 jne 400b30 <main+0x240>
: } else {
: carry = 0;
: }
: data[i++] = tmp;
: }
: data[i] += carry;
0.00 : 400b5e: 4a 01 0c c6 add %rcx,(%rsi,%r8,8)
0.00 : 400b62: 48 83 c7 08 add $0x8,%rdi
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1401 次 |
最近记录: |