una*_*ann 8 c optimization x86-64 perf
我正在尝试优化C语言中的代码,似乎一条指令占用了大约22%的时间。
该代码是使用gcc 8.2.0编译的。标志是-O3 -DNDEBUG -g和-Wall -Wextra -Weffc++ -pthread -lrt。
509529.517218 task-clock (msec) # 0.999 CPUs utilized
6,234 context-switches # 0.012 K/sec
10 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,985,640,853,831 cycles # 3.897 GHz (30.76%)
1,897,574,410,921 instructions # 0.96 insn per cycle (38.46%)
229,365,727,020 branches # 450.152 M/sec (38.46%)
13,027,677,754 branch-misses # 5.68% of all branches (38.46%)
604,340,619,317 L1-dcache-loads # 1186.076 M/sec (38.46%)
47,749,307,910 L1-dcache-load-misses # 7.90% of all L1-dcache hits (38.47%)
19,724,956,845 LLC-loads # 38.712 M/sec (30.78%)
3,349,412,068 LLC-load-misses # 16.98% of all LL-cache hits (30.77%)
<not supported> L1-icache-loads
129,878,634 L1-icache-load-misses (30.77%)
604,482,046,140 dTLB-loads # 1186.353 M/sec (30.77%)
4,596,384,416 dTLB-load-misses # 0.76% of all dTLB cache hits (30.77%)
2,493,696 iTLB-loads # 0.005 M/sec (30.77%)
21,356,368 iTLB-load-misses # 856.41% of all iTLB cache hits (30.76%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
509.843595752 seconds time elapsed
507.706093000 seconds user
1.839848000 seconds sys
Run Code Online (Sandbox Code Playgroud)
VTune Amplifier给我一个功能提示:https : //pasteboard.co/IagrLaF.png
该指令cmpq似乎占整个时间的22%。另一方面,其他指令花费的时间可以忽略不计。
perf 给了我一些不同的印象,但我认为结果是一致的:
Percent? bool mapFound = false;
0.00 ? movb $0x0,0x7(%rsp)
? goDownBwt():
? bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
0.00 ? lea 0x20(%rsp),%r12
? newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.00 ? lea (%rax,%rax,2),%rax
0.00 ? shl $0x3,%rax
0.00 ? mov %rax,0x18(%rsp)
0.01 ? movzwl %dx,%eax
0.00 ? mov %eax,(%rsp)
0.00 ? ? jmp d6
? nop
? if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
0.30 ? 88: mov (%rax),%rsi
8.38 ? mov 0x10(%rsi),%rcx
0.62 ? test %rcx,%rcx
0.15 ? ? je 1b0
? newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
2.05 ? add 0x18(%rsp),%rcx
? ++stats->nDownPreprocessed;
0.25 ? addq $0x1,0x18(%rdx)
? newState->trace = PREPROCESSED;
0.98 ? movb $0x10,0x30(%rsp)
? return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
43.36 ? mov 0x8(%rcx),%rax
2.61 ? cmp %rax,(%rcx)
? newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
0.05 ? mov %rcx,0x20(%rsp)
? return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
3.47 ? setbe %dl
Run Code Online (Sandbox Code Playgroud)
该功能是
inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
++stats->nDown;
if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
++stats->nDownPreprocessed;
newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
newState->trace = PREPROCESSED;
return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
}
bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
newState->trace = 0;
return (newState->interval.k <= newState->interval.l);
}
Run Code Online (Sandbox Code Playgroud)
state_t 被定义为
struct state_t {
union {
bwtinterval_t interval;
preprocessedInterval_t *preprocessedInterval;
};
unsigned char trace;
struct state_t *previousState;
};
Run Code Online (Sandbox Code Playgroud)
preprocessedInterval_t 是:
struct preprocessedInterval_t {
bwtinterval_t interval;
preprocessedInterval_t *firstChild;
};
Run Code Online (Sandbox Code Playgroud)
很少(〜1000)个state_t结构。但是,有许多(350k)preprocessedInterval_t对象分配在其他位置。
首先if是150亿乘以190亿的真实值。
perf record -e branches,branch-misses mytool在函数上找到错误的分支将给我:
Available samples
2M branches
1M branch-misses
Run Code Online (Sandbox Code Playgroud)
我可以假定分支预测错误是造成这种速度下降的原因吗?优化我的代码的下一步是什么?
该代码可在GitHub上获得
编辑1
valgrind --tool=cachegrind 给我:
I refs: 1,893,716,274,393
I1 misses: 4,702,494
LLi misses: 137,142
I1 miss rate: 0.00%
LLi miss rate: 0.00%
D refs: 756,774,557,235 (602,597,601,611 rd + 154,176,955,624 wr)
D1 misses: 39,489,866,187 ( 33,583,272,379 rd + 5,906,593,808 wr)
LLd misses: 3,483,920,786 ( 3,379,118,877 rd + 104,801,909 wr)
D1 miss rate: 5.2% ( 5.6% + 3.8% )
LLd miss rate: 0.5% ( 0.6% + 0.1% )
LL refs: 39,494,568,681 ( 33,587,974,873 rd + 5,906,593,808 wr)
LL misses: 3,484,057,928 ( 3,379,256,019 rd + 104,801,909 wr)
LL miss rate: 0.1% ( 0.1% + 0.1% )
Run Code Online (Sandbox Code Playgroud)
编辑2
我使用编译-O3 -DNDEBUG -march=native -fprofile-use,并使用了命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out
508322.348021 task-clock (msec) # 0.998 CPUs utilized
21,592 context-switches # 0.042 K/sec
33 cpu-migrations # 0.000 K/sec
1,305,885 page-faults # 0.003 M/sec
1,978,382,746,597 cycles # 3.892 GHz (44.44%)
228,898,532,311 branches # 450.302 M/sec (44.45%)
12,816,920,039 branch-misses # 5.60% of all branches (44.45%)
1,867,947,557,739 instructions # 0.94 insn per cycle (55.56%)
2,957,085,686,275 uops_issued.any # 5817.343 M/sec (55.56%)
2,864,257,274,102 uops_executed.thread # 5634.726 M/sec (55.56%)
2,490,571,629 mem_load_uops_retired.l3_miss # 4.900 M/sec (55.55%)
12,482,683,638 mem_load_uops_retired.l2_miss # 24.557 M/sec (55.55%)
18,634,558,602 mem_load_uops_retired.l1_miss # 36.659 M/sec (44.44%)
509.210162391 seconds time elapsed
506.213075000 seconds user
2.147749000 seconds sys
Run Code Online (Sandbox Code Playgroud)
编辑3
我选择了perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out提到我的函数的结果:
Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead Command Shared Object Symbol
49.76% srnaMapper srnaMapper [.] mapWithoutError
Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead Command Shared Object Symbol
49.36% srnaMapper srnaMapper [.] mapWithoutError
Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead Command Shared Object Symbol
43.01% srnaMapper srnaMapper [.] mapWithoutError
Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead Command Shared Object Symbol
50.32% srnaMapper srnaMapper [.] mapWithoutError
Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead Command Shared Object Symbol
34.85% srnaMapper srnaMapper [.] mapWithoutError
Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead Command Shared Object Symbol
42.19% srnaMapper srnaMapper [.] mapWithoutError
Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead Command Shared Object Symbol
48.14% srnaMapper srnaMapper [.] mapWithoutError
Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead Command Shared Object Symbol
33.24% srnaMapper srnaMapper [.] addState
31.00% srnaMapper srnaMapper [.] mapWithoutError
Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead Command Shared Object Symbol
40.79% srnaMapper srnaMapper [.] mapWithoutError
Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead Command Shared Object Symbol
39.11% srnaMapper srnaMapper [.] mapWithoutError
Run Code Online (Sandbox Code Playgroud)
(使用perf record --period 10000触发器Workload failed: No such file or directory。)
Was the sample-rate the same for branches and branch-misses? A 50% mispredict rate would be extremely bad.
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate explains that the kernel dynamically adjusts the period for each counter so events fire often enough to get enough samples even for rare events, But you can set the period (how many raw counts trigger a sample) I think that's what perf record --period 10000 does, but I haven't used that.
Use perf stat to get hard numbers. Update: yup, your perf stat results confirm your branch mispredict rate is "only" 5%, not 50%, at least for the program as a whole. That's still higher than you'd like (branches are usually frequent and mispredicts are expensive) but not insane.
Also for cache miss rate for L1d and maybe mem_load_retired.l3_miss (and/or l2_miss and l1_miss) to see if it's really that load that's missing. e.g.
perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss ./a.out
Run Code Online (Sandbox Code Playgroud)
You can use any of these events with perf record to get some statistical samples on which instructions are causing cache misses. Those are precise events (using PEBS), so should accurately map to the correct instruction (not like "cycles" where counts get attributed to some nearby instruction, often the one that stalls waiting for an input with the ROB full, instead of the one that was slow to produce it.)
And without any skew for non-PEBS events that should "blame" a single instruction but don't always interrupt at exactly the right place.
If you're optimizing for your local machine and don't need it to run anywhere else, you might use -O3 -march=native. Not that that will help with cache misses.
GCC profile-guided optimization can help it choose branchy vs. branchless. (gcc -O3 -march=native -fprofile-generate / run it with some realistic input data to generate profile outputs / gcc -O3 -march=native -fprofile-use)
Can I assume that branch misprediction is responsible for this slow down?
No, cache misses might be more likely. You have a significant number of L3 misses, and going all the way to DRAM costs hundreds of core clock cycles. Branch prediction can hide some of that if it predicts correctly.
What would be the next step to optimize my code?
Compact your data structures if possible so more of them fit in cache, e.g. 32-bit pointers (Linux x32 ABI: gcc -mx32) if you don't need more than 4GiB of virtual address space. Or maybe try using a 32-bit unsigned index into a large array instead of raw pointers, but that has slightly worse load-use latency (by a couple cycles on Sandybridge-family.)
And / or improve your access pattern, so you're mostly accessing them in sequential order. So hardware prefetch can bring them into cache before you need to read them.
我对https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform或其在序列比对中的应用不够熟悉,不知道是否可以使其更高效,但数据压缩本质上是有问题的因为您经常需要依赖数据的分支和访问分散的数据。不过,与更多的缓存未命中相比,这种权衡通常是值得的。