如何减少花在一条指令上的时间?

una*_*ann 8 c optimization x86-64 perf

我正在尝试优化C语言中的代码,似乎一条指令占用了大约22%的时间。

该代码是使用gcc 8.2.0编译的。标志是-O3 -DNDEBUG -g-Wall -Wextra -Weffc++ -pthread -lrt

    509529.517218      task-clock (msec)         #    0.999 CPUs utilized
            6,234      context-switches          #    0.012 K/sec
               10      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,985,640,853,831      cycles                    #    3.897 GHz                      (30.76%)
1,897,574,410,921      instructions              #    0.96  insn per cycle           (38.46%)
  229,365,727,020      branches                  #  450.152 M/sec                    (38.46%)
   13,027,677,754      branch-misses             #    5.68% of all branches          (38.46%)
  604,340,619,317      L1-dcache-loads           # 1186.076 M/sec                    (38.46%)
   47,749,307,910      L1-dcache-load-misses     #    7.90% of all L1-dcache hits    (38.47%)
   19,724,956,845      LLC-loads                 #   38.712 M/sec                    (30.78%)
    3,349,412,068      LLC-load-misses           #   16.98% of all LL-cache hits     (30.77%)
  <not supported>      L1-icache-loads                                          
      129,878,634      L1-icache-load-misses                                         (30.77%)
  604,482,046,140      dTLB-loads                # 1186.353 M/sec                    (30.77%)
    4,596,384,416      dTLB-load-misses          #    0.76% of all dTLB cache hits   (30.77%)
        2,493,696      iTLB-loads                #    0.005 M/sec                    (30.77%)
       21,356,368      iTLB-load-misses          #  856.41% of all iTLB cache hits   (30.76%)
  <not supported>      L1-dcache-prefetches                                     
  <not supported>      L1-dcache-prefetch-misses                                

    509.843595752 seconds time elapsed

    507.706093000 seconds user
      1.839848000 seconds sys
Run Code Online (Sandbox Code Playgroud)

VTune Amplifier给我一个功能提示:https : //pasteboard.co/IagrLaF.png

该指令cmpq似乎占整个时间的22%。另一方面,其他指令花费的时间可以忽略不计。

perf 给了我一些不同的印象,但我认为结果是一致的:

 Percent?       bool mapFound = false;
   0.00 ?       movb   $0x0,0x7(%rsp)
        ?     goDownBwt():
        ?       bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
   0.00 ?       lea    0x20(%rsp),%r12
        ?         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.00 ?       lea    (%rax,%rax,2),%rax
   0.00 ?       shl    $0x3,%rax
   0.00 ?       mov    %rax,0x18(%rsp)
   0.01 ?       movzwl %dx,%eax
   0.00 ?       mov    %eax,(%rsp)
   0.00 ?     ? jmp    d6
        ?       nop
        ?       if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
   0.30 ? 88:   mov    (%rax),%rsi
   8.38 ?       mov    0x10(%rsi),%rcx
   0.62 ?       test   %rcx,%rcx
   0.15 ?     ? je     1b0
        ?         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   2.05 ?       add    0x18(%rsp),%rcx
        ?         ++stats->nDownPreprocessed;
   0.25 ?       addq   $0x1,0x18(%rdx)
        ?         newState->trace                = PREPROCESSED;
   0.98 ?       movb   $0x10,0x30(%rsp)
        ?         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  43.36 ?       mov    0x8(%rcx),%rax
   2.61 ?       cmp    %rax,(%rcx)
        ?         newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
   0.05 ?       mov    %rcx,0x20(%rsp)
        ?         return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
   3.47 ?       setbe  %dl
Run Code Online (Sandbox Code Playgroud)

该功能是

inline bool goDownBwt (state_t *previousState, unsigned short nucleotide, state_t *newState) {
  ++stats->nDown;
  if ((previousState->trace & PREPROCESSED) && (previousState->preprocessedInterval->firstChild != NULL)) {
    ++stats->nDownPreprocessed;
    newState->preprocessedInterval = previousState->preprocessedInterval->firstChild + nucleotide;
    newState->trace                = PREPROCESSED;
    return (newState->preprocessedInterval->interval.k <= newState->preprocessedInterval->interval.l);
  }
  bwt_2occ(bwt, getStateInterval(previousState)->k-1, getStateInterval(previousState)->l, nucleotide, &newState->interval.k, &newState->interval.l);
  newState->interval.k = bwt->L2[nucleotide] + newState->interval.k + 1;
  newState->interval.l = bwt->L2[nucleotide] + newState->interval.l;
  newState->trace = 0;
  return (newState->interval.k <= newState->interval.l);
}
Run Code Online (Sandbox Code Playgroud)

state_t 被定义为

struct state_t {
  union {
    bwtinterval_t interval;
    preprocessedInterval_t *preprocessedInterval;
  };
  unsigned char trace;
  struct state_t *previousState;
};
Run Code Online (Sandbox Code Playgroud)

preprocessedInterval_t 是:

struct preprocessedInterval_t {
  bwtinterval_t interval;
  preprocessedInterval_t *firstChild;
};
Run Code Online (Sandbox Code Playgroud)

很少(〜1000)个state_t结构。但是,有许多(350k)preprocessedInterval_t对象分配在其他位置。

首先if是150亿乘以190亿的真实值。

perf record -e branches,branch-misses mytool在函数上找到错误的分支将给我:

Available samples
2M branches                                                                                                                                                                                                       
1M branch-misses  
Run Code Online (Sandbox Code Playgroud)

我可以假定分支预测错误是造成这种速度下降的原因吗?优化我的代码的下一步是什么?

该代码可在GitHub上获得


编辑1

valgrind --tool=cachegrind 给我:

I   refs:      1,893,716,274,393
I1  misses:            4,702,494
LLi misses:              137,142
I1  miss rate:              0.00%
LLi miss rate:              0.00%

D   refs:        756,774,557,235  (602,597,601,611 rd   + 154,176,955,624 wr)
D1  misses:       39,489,866,187  ( 33,583,272,379 rd   +   5,906,593,808 wr)
LLd misses:        3,483,920,786  (  3,379,118,877 rd   +     104,801,909 wr)
D1  miss rate:               5.2% (            5.6%     +             3.8%  )
LLd miss rate:               0.5% (            0.6%     +             0.1%  )

LL refs:          39,494,568,681  ( 33,587,974,873 rd   +   5,906,593,808 wr)
LL misses:         3,484,057,928  (  3,379,256,019 rd   +     104,801,909 wr)
LL miss rate:                0.1% (            0.1%     +             0.1%  )
Run Code Online (Sandbox Code Playgroud)

编辑2

我使用编译-O3 -DNDEBUG -march=native -fprofile-use,并使用了命令perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss ./a.out

    508322.348021      task-clock (msec)         #    0.998 CPUs utilized
           21,592      context-switches          #    0.042 K/sec
               33      cpu-migrations            #    0.000 K/sec
        1,305,885      page-faults               #    0.003 M/sec
1,978,382,746,597      cycles                    #    3.892 GHz                      (44.44%)
  228,898,532,311      branches                  #  450.302 M/sec                    (44.45%)
   12,816,920,039      branch-misses             #    5.60% of all branches          (44.45%)
1,867,947,557,739      instructions              #    0.94  insn per cycle           (55.56%)
2,957,085,686,275      uops_issued.any           # 5817.343 M/sec                    (55.56%)
2,864,257,274,102      uops_executed.thread      # 5634.726 M/sec                    (55.56%)
    2,490,571,629      mem_load_uops_retired.l3_miss #    4.900 M/sec                    (55.55%)
   12,482,683,638      mem_load_uops_retired.l2_miss #   24.557 M/sec                    (55.55%)
   18,634,558,602      mem_load_uops_retired.l1_miss #   36.659 M/sec                    (44.44%)

    509.210162391 seconds time elapsed

    506.213075000 seconds user
      2.147749000 seconds sys
Run Code Online (Sandbox Code Playgroud)

编辑3

我选择了perf record -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread,mem_load_uops_retired.l3_miss,mem_load_uops_retired.l2_miss,mem_load_uops_retired.l1_miss a.out提到我的函数的结果:

Samples: 2M of event 'task-clock', Event count (approx.): 517526250000
Overhead  Command     Shared Object       Symbol
  49.76%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 917K of event 'cycles', Event count (approx.): 891499601652
Overhead  Command     Shared Object       Symbol
  49.36%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 911K of event 'branches', Event count (approx.): 101918042567
Overhead  Command     Shared Object       Symbol
  43.01%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 877K of event 'branch-misses', Event count (approx.): 5689088740
Overhead  Command     Shared Object       Symbol
  50.32%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 1M of event 'instructions', Event count (approx.): 1036429973874
Overhead  Command     Shared Object       Symbol
  34.85%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 824K of event 'uops_issued.any', Event count (approx.): 1649042473560
Overhead  Command     Shared Object       Symbol
  42.19%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 802K of event 'uops_executed.thread', Event count (approx.): 1604052406075
Overhead  Command     Shared Object       Symbol
  48.14%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 13K of event 'mem_load_uops_retired.l3_miss', Event count (approx.): 1350194507
Overhead  Command     Shared Object      Symbol
  33.24%  srnaMapper  srnaMapper         [.] addState
  31.00%  srnaMapper  srnaMapper         [.] mapWithoutError

Samples: 142K of event 'mem_load_uops_retired.l2_miss', Event count (approx.): 7143448989
Overhead  Command     Shared Object       Symbol
  40.79%  srnaMapper  srnaMapper          [.] mapWithoutError

Samples: 84K of event 'mem_load_uops_retired.l1_miss', Event count (approx.): 8451553539
Overhead  Command     Shared Object       Symbol
  39.11%  srnaMapper  srnaMapper          [.] mapWithoutError
Run Code Online (Sandbox Code Playgroud)

(使用perf record --period 10000触发器Workload failed: No such file or directory。)

Pet*_*des 4

Was the sample-rate the same for branches and branch-misses? A 50% mispredict rate would be extremely bad.

https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate explains that the kernel dynamically adjusts the period for each counter so events fire often enough to get enough samples even for rare events, But you can set the period (how many raw counts trigger a sample) I think that's what perf record --period 10000 does, but I haven't used that.

Use perf stat to get hard numbers. Update: yup, your perf stat results confirm your branch mispredict rate is "only" 5%, not 50%, at least for the program as a whole. That's still higher than you'd like (branches are usually frequent and mispredicts are expensive) but not insane.


Also for cache miss rate for L1d and maybe mem_load_retired.l3_miss (and/or l2_miss and l1_miss) to see if it's really that load that's missing. e.g.

perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,\
uops_issued.any,uops_executed.thread,\
mem_load_retired.l3_miss,mem_load_retired.l2_miss,mem_load_retired.l1_miss  ./a.out
Run Code Online (Sandbox Code Playgroud)

You can use any of these events with perf record to get some statistical samples on which instructions are causing cache misses. Those are precise events (using PEBS), so should accurately map to the correct instruction (not like "cycles" where counts get attributed to some nearby instruction, often the one that stalls waiting for an input with the ROB full, instead of the one that was slow to produce it.)

And without any skew for non-PEBS events that should "blame" a single instruction but don't always interrupt at exactly the right place.


If you're optimizing for your local machine and don't need it to run anywhere else, you might use -O3 -march=native. Not that that will help with cache misses.

GCC profile-guided optimization can help it choose branchy vs. branchless. (gcc -O3 -march=native -fprofile-generate / run it with some realistic input data to generate profile outputs / gcc -O3 -march=native -fprofile-use)


Can I assume that branch misprediction is responsible for this slow down?

No, cache misses might be more likely. You have a significant number of L3 misses, and going all the way to DRAM costs hundreds of core clock cycles. Branch prediction can hide some of that if it predicts correctly.

What would be the next step to optimize my code?

Compact your data structures if possible so more of them fit in cache, e.g. 32-bit pointers (Linux x32 ABI: gcc -mx32) if you don't need more than 4GiB of virtual address space. Or maybe try using a 32-bit unsigned index into a large array instead of raw pointers, but that has slightly worse load-use latency (by a couple cycles on Sandybridge-family.)

And / or improve your access pattern, so you're mostly accessing them in sequential order. So hardware prefetch can bring them into cache before you need to read them.

我对https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform或其在序列比对中的应用不够熟悉,不知道是否可以使其更高效,但数据压缩本质上是有问题的因为您经常需要依赖数据的分支和访问分散的数据。不过,与更多的缓存未命中相比,这种权衡通常是值得的。

  • @unamourdeswann:好吧,对于整个程序来说,你只能得到 5% 的分支错误预测率。这仍然比你想要的要高(分支通常很频繁,错误预测的代价也很高),但并不疯狂。https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate 解释说,内核动态调整每个计数器的周期,因此事件经常触发,即使对于罕见事件也能获得足够的样本,但您可以设置周期(有多少原始计数触发样本)我认为这就是“perf record --period 10000”的作用,但我还没有使用过。 (2认同)
  • 从问题的不太清晰的表述中得出了相当令人印象深刻的推论。如果您想了解更多关于 BWT(Burrows-Wheeler 变换)和 FM-Index 的信息,您可以阅读[这篇文章](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2705234/),我的工作就是以此为基础的。 (2认同)
  • @unamourdeswann:我的线索是 C 代码中的“核苷酸”-&gt; 这是在 DNA 序列上工作。我知道“bwt”可能是 Burrows Wheeler(在 bzip2 压缩的背景下听说过它,在“xz”LZMA 接管之前,这是 Linux 软件包/tarball 的首选)。但我不确定,所以我只是用谷歌搜索了“bwt序列”,[BWT维基百科页面的底部](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform#BWT_in_bioinformatics)说了一些东西类似于您链接的那篇论文的摘要。:) (2认同)