Perf过度计算简单的CPU绑定循环:神秘的内核工作?

Bee*_*ope 9 performance x86 assembly linux-kernel perf

我已经使用Linux perf一段时间来进行应用程序分析.通常情况下,配置文件应用程序相当复杂,因此只要根据第一原则与您的预期没有任何严重差异,就可以简单地将报告的计数器值视为面值.

然而,最近,我已经描述了一些简单的64位汇编程序 - 足够的竞争,人们几乎可以计算各种计数器的预期值,而且似乎perf stat是过度计算.

以下面的循环为例:

.loop:
    nop
    dec rax
    nop
    jne .loop
Run Code Online (Sandbox Code Playgroud)

这将简单地循环n次数,其中n是初始值rax.循环的每次迭代都执行4条指令,因此您可以期望4 * n执行指令,加上一些用于进程启动和终止的固定开销,以及n在进入循环之前设置的一小段代码.

这是(典型)perf stat输出n = 1,000,000,000:

~/dev/perf-test$ perf stat ./perf-test-nop 1

 Performance counter stats for './perf-test-nop 1':

        301.795151      task-clock (msec)         #    0.998 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.007 K/sec                  
     1,003,144,430      cycles                    #    3.324 GHz                      
     4,000,410,032      instructions              #    3.99  insns per cycle        
     1,000,071,277      branches                  # 3313.742 M/sec                  
             1,649      branch-misses             #    0.00% of all branches        

       0.302318532 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)

呵呵.我们看到一个神秘的额外410,032指令和71,277个分支,而不是大约4,000,000,000个指令和1,000,000,000个分支.总有"额外"指令,但数量有所不同 - 例如,后续运行分别有421K,563K和464K 额外指令.您可以通过构建我的简单github项目在您的系统上自行运行.

好的,所以你可能会猜到这几十个额外的指令只是固定的应用程序设置和拆卸成本(用户区设置非常小,但可能有隐藏的东西).让我们试试吧n=10 billion:

~/dev/perf-test$ perf stat ./perf-test-nop 10

 Performance counter stats for './perf-test-nop 10':

       2907.748482      task-clock (msec)         #    1.000 CPUs utilized          
                 3      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.001 K/sec                  
    10,012,820,060      cycles                    #    3.443 GHz                    
    40,004,878,385      instructions              #    4.00  insns per cycle        
    10,001,036,040      branches                  # 3439.443 M/sec                  
             4,960      branch-misses             #    0.00% of all branches        

       2.908176097 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)

现在有大约490万条额外指令,比之前大约增加10倍,与循环计数增加10倍成比例.

您可以尝试各种计数器 - 所有与CPU相关的计数器都显示相似的比例增加.让我们关注指令计数以保持简单.分别使用:u:k后缀来衡量用户内核计数,显示几乎所有额外事件在内核帐户中产生的计数:

~/dev/perf-test$ perf stat -e instructions:u,instructions:k ./perf-test-nop 1

 Performance counter stats for './perf-test-nop 1':

     4,000,000,092      instructions:u                                              
           388,958      instructions:k                                              

       0.301323626 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)

答对了.在389,050条额外指令中,其中99.98%(388,958)完全在内核中发生.

好的,但是我们离开了哪里?这是一个简单的CPU绑定循环.它不进行任何系统调用,也不访问内存(可能通过页面错误机制间接调用内核).为什么内核代表我的应用程序执行指令?

它似乎不是由上下文切换或CPU迁移引起的,因为它们处于或接近于零,并且在任何情况下,额外指令计数与发生更多这些事件的运行不相关.

实际上,额外内核指令的数量与循环计数非常平滑.这是(数十亿)循环迭代与内核指令的图表:

循环计数与内核指令

你可以看到这种关系几乎完全是线性的 - 事实上直到15e9迭代只有一个异常值.在那之后,似乎有两条独立的线条,暗示某种形式的量化会导致超时.在任何情况下,对于主循环中执行的每个1e9指令,都会产生大约350K的内核指令.

最后,我注意到执行的内核指令的数量似乎与运行时1(或CPU时间)成比例,而不是执行的指令.为了测试这个,我使用了一个类似的程序,但是其中一个nop指令被替换为idiv具有大约40个周期的延迟(删除了一些不感兴趣的行):

~/dev/perf-test$ perf stat ./perf-test-div 10

 Performance counter stats for './perf-test-div 10':

    41,768,314,396      cycles                    #    3.430 GHz                       
     4,014,826,989      instructions              #    0.10  insns per cycle        
     1,002,957,543      branches                  #   82.369 M/sec                  

      12.177372636 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)

在这里,我们花了大约42e9个循环来完成1e9次迭代,我们有大约14,800,000个额外指令.相比之下,相同的1e9循环只有约400,000条额外指令nop.如果我们与nop大约相同数量cycles(40e9次迭代)的循环进行比较,我们会看到几乎完全相同的额外指令数量:

~/dev/perf-test$ perf stat ./perf-test-nop 41

 Performance counter stats for './perf-test-nop 41':

    41,145,332,629      cycles                    #    3.425 
   164,013,912,324      instructions              #    3.99  insns per cycle        
    41,002,424,948      branches                  # 3412.968 M/sec                  

      12.013355313 seconds time elapsed
Run Code Online (Sandbox Code Playgroud)

这个神秘的工作在内核中发生了什么?


1这里我在这里或多或少地交替使用术语"时间"和"周期".在这些测试期间,CPU会平稳运行,因此模拟一些与涡轮增压相关的热效应,周期与时间成正比.

Iwi*_*ist 7

TL; DR

答案很简单.您的计数器设置为在用户操作系统模式下都会计数,并且Linux的时间切片调度程序会定期对您的测量进行干扰,每秒计时几次.

幸运的你,而4天前调查与@PeterCordes一个不相关的问题,我发表了我自己的性能计数器访问软件的清理后的版本,libpfc.

libpfc

libpfc is a very low-level library and Linux Loadable Kernel Module that I coded myself using only the complete Intel Software Developers' Manual as reference. The performance counting facility is documented in Volume 3, Chapter 18-19 of the SDM. It is configured by writing specific values to specific MSRs (Model-Specific Registers) present in certain x86 processors.

There are two types of counters that can be configured on my Intel Haswell processor:

  • Fixed-Function Counters. They are restricted to counting a specific type of event. They can be enabled or disabled but what they track can't be changed. On the 2.4GHz Haswell i7-4700MQ there are 3:

    • Instructions Issued: What it says on the tin.
    • Unhalted Core Cycles:实际发生的时钟周期数.如果处理器的频率向上或向下扩展,则此计数器将在每单位时间内分别开始更快或更慢的计数.
    • Unhalted Reference Cycles:时钟节拍的数量,以不受动态频率调整影响的恒定频率滴答.在我的2.4GHz处理器上,它的精确度为2.4GHz.因此,Unhalted Reference / 2.4e9给出亚纳秒精度的时序,并Unhalted Core / Unhalted Reference给出从Turbo Boost获得的平均加速因子.
  • General-Purpose Counters. These can in general be configured to track any event (with only a few limitations) that is listed in the SDM for your particular processor. For Haswell, they're currently listed in the SDMs's Volume 3, §19.4, and my repository contains a demo, pfcdemo, that accesses a large subset of them. They're listed at pfcdemo.c:169. On my Haswell processor, when HyperThreading is enabled, each core has 4 such counters.

Configuring the counters properly

To configure the counters, I took upon the burden of programming every MSR myself in my LKM, pfc.ko, whose source code is included in my repository.

Programming MSRs must be done extremely carefully, else the processor will punish you with a kernel panic. For this reason I familiarized myself with every single bit of 5 different types of MSRs, in addition to the general-purpose and fixed-function counters themselves. My notes on these registers are at pfckmod.c:750, and are reproduced here:

/** 186+x IA32_PERFEVTSELx           -  Performance Event Selection, ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {................................################################}
 *                                                     |      ||||||||||      ||      |
 *     Counter Mask -----------------------------------^^^^^^^^|||||||||      ||      |
 *     Invert Counter Mask ------------------------------------^||||||||      ||      |
 *     Enable Counter ------------------------------------------^|||||||      ||      |
 *     AnyThread ------------------------------------------------^||||||      ||      |
 *     APIC Interrupt Enable -------------------------------------^|||||      ||      |
 *     Pin Control ------------------------------------------------^||||      ||      |
 *     Edge Detect -------------------------------------------------^|||      ||      |
 *     Operating System Mode ----------------------------------------^||      ||      |
 *     User Mode -----------------------------------------------------^|      ||      |
 *     Unit Mask (UMASK) ----------------------------------------------^^^^^^^^|      |
 *     Event Select -----------------------------------------------------------^^^^^^^^
 */
/** 309+x IA32_FIXED_CTRx            -  Fixed-Function Counter,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {????????????????????????????????????????????????????????????????}
 *                     |                                                              |
 *     Counter Value --^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *     
 *     NB: Number of FF counters determined by    CPUID.0x0A.EDX[ 4: 0]
 *     NB: ???? FF counter bitwidth determined by CPUID.0x0A.EDX[12: 5]
 */
/** 38D   IA32_FIXED_CTR_CTRL        -  Fixed Counter Controls,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {....................................................############}
 *                                                                         |  ||  |||||
 *     IA32_FIXED_CTR2 controls  ------------------------------------------^^^^|  |||||
 *     IA32_FIXED_CTR1 controls  ----------------------------------------------^^^^||||
 *                                                                                 ||||
 *     IA32_FIXED_CTR0 controls:                                                   ||||
 *     IA32_FIXED_CTR0 PMI --------------------------------------------------------^|||
 *     IA32_FIXED_CTR0 AnyThread ---------------------------------------------------^||
 *     IA32_FIXED_CTR0 enable (0:Disable 1:OS 2:User 3:All) -------------------------^^
 */
/** 38E   IA32_PERF_GLOBAL_STATUS    -  Global Overflow Status,      ArchPerfMon v3
 * 
 *                  /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                 {###..........................###............................####}
 *                  |||                          |||                            ||||
 *     CondChgd ----^||                          |||                            ||||
 *     OvfDSBuffer --^|                          |||                            ||||
 *     OvfUncore -----^                          |||                            ||||
 *     IA32_FIXED_CTR2 Overflow -----------------^||                            ||||
 *     IA32_FIXED_CTR1 Overflow ------------------^|                            ||||
 *     IA32_FIXED_CTR0 Overflow -------------------^                            ||||
 *     IA32_PMC(N-1)   Overflow ------------------------------------------------^|||
 *     ....                     -------------------------------------------------^||
 *     IA32_PMC1       Overflow --------------------------------------------------^|
 *     IA32_PMC0       Overflow ---------------------------------------------------^
 */
/** 38F   IA32_PERF_GLOBAL_CTRL      -  Global Enable Controls,      ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {.............................###............................####}
 *                                                  |||                            ||||
 *     IA32_FIXED_CTR2 enable ----------------------^||                            ||||
 *     IA32_FIXED_CTR1 enable -----------------------^|                            ||||
 *     IA32_FIXED_CTR0 enable ------------------------^                            ||||
 *     IA32_PMC(N-1)   enable -----------------------------------------------------^|||
 *     ....                   ------------------------------------------------------^||
 *     IA32_PMC1       enable -------------------------------------------------------^|
 *     IA32_PMC0       enable --------------------------------------------------------^
 */
/** 390   IA32_PERF_GLOBAL_OVF_CTRL  -  Global Overflow Control,     ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {###..........................###............................####}
 *                     |||                          |||                            ||||
 *     ClrCondChgd ----^||                          |||                            ||||
 *     ClrOvfDSBuffer --^|                          |||                            ||||
 *     ClrOvfUncore -----^                          |||                            ||||
 *     IA32_FIXED_CTR2 ClrOverflow -----------------^||                            ||||
 *     IA32_FIXED_CTR1 ClrOverflow ------------------^|                            ||||
 *     IA32_FIXED_CTR0 ClrOverflow -------------------^                            ||||
 *     IA32_PMC(N-1)   ClrOverflow ------------------------------------------------^|||
 *     ....                        -------------------------------------------------^||
 *     IA32_PMC1       ClrOverflow --------------------------------------------------^|
 *     IA32_PMC0       ClrOverflow ---------------------------------------------------^
 */
/** 4C1+x IA32_A_PMCx                -  General-Purpose Counter,     ArchPerfMon v3
 * 
 *                     /63/60 /56     /48     /40     /32     /24     /16     /08     /00
 *                    {????????????????????????????????????????????????????????????????}
 *                     |                                                              |
 *     Counter Value --^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 *     
 *     NB: Number of GP counters determined by    CPUID.0x0A.EAX[15: 8]
 *     NB: ???? GP counter bitwidth determined by CPUID.0x0A.EAX[23:16]
 */
Run Code Online (Sandbox Code Playgroud)

In particular, observe IA32_PERFEVTSELx, bits 16 (User Mode) and 17 (OS Mode) and IA32_FIXED_CTR_CTRL, bits IA32_FIXED_CTRx enable. IA32_PERFEVTSELx configures general-purpose counter x, while each group of 4 bits starting at bit 4*x counting from the LSB in IA32_FIXED_CTR_CTRL configures fixed-function counter x.

在MSR中IA32_PERFEVTSELx,如果在用户位置1时OS位清零,则计数器仅在用户模式下累积事件,并将排除内核模式事件.在MSR中IA32_FIXED_CTRL_CTRL,每组4位包含一个两位enable字段,如果设置为2(0b10)将启用用户模式中的事件计数,但不能用于内核模式.

我的LKM强制用户模式仅用于计数在两个固定功能和通用计数器pfckmod.c:296pfckmod.c:330分别.

用户空间

In user-space, the user configures the counters (example of the process, starting at pfcdemo.c:98), then sandwiches the code to be timed using the macros PFCSTART() and PFCEND(). These are very specific code sequences, but they have a cost, and thus produce a biased result that overcounts the number of events from the timers. Thus, you must also call pfcRemoveBias(), which times PFCSTART()/PFCEND() when they surround 0 instructions, and removes the bias from the accumulated count.

Your Code, under libpfc

I took your code and put it into pfcdemo.c:130, such that it now read

/************** Hot section **************/
PFCSTART(CNT);
asm volatile(
".intel_syntax noprefix\n\t"
"mov             rax, 1000000000\n\t"
".loop:\n\t"
"nop\n\t"
"dec             rax\n\t"
"nop\n\t"
"jne             .loop\n\t"
".att_syntax noprefix\n\t"
: /* No outputs we care about */
: /* No inputs we care about */
: "rax", "memory", "cc"
);
PFCEND  (CNT);
/************ End Hot section ************/
Run Code Online (Sandbox Code Playgroud)

. I got the following:

Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1001668898
Unhalted reference cycles            :            735432000
uops_issued.any                      :           4000261487
uops_issued.any<1                    :              2445188
uops_issued.any>=1                   :           1000095148
uops_issued.any>=2                   :           1000070454
Instructions Issued                  :           4000000084
Unhalted core cycles                 :           1002792358
Unhalted reference cycles            :            741096720
uops_issued.any>=3                   :           1000057533
uops_issued.any>=4                   :           1000044117
uops_issued.any>=5                   :                    0
uops_issued.any>=6                   :                    0
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1011149969
Unhalted reference cycles            :            750048048
uops_executed_port.port_0            :            374577796
uops_executed_port.port_1            :            227762669
uops_executed_port.port_2            :                 1077
uops_executed_port.port_3            :                 2271
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1006474726
Unhalted reference cycles            :            749845800
uops_executed_port.port_4            :                 3436
uops_executed_port.port_5            :            438401716
uops_executed_port.port_6            :           1000083071
uops_executed_port.port_7            :                 1255
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1009164617
Unhalted reference cycles            :            756860736
resource_stalls.any                  :                 1365
resource_stalls.rs                   :                    0
resource_stalls.sb                   :                    0
resource_stalls.rob                  :                    0
Instructions Issued                  :           4000000083
Unhalted core cycles                 :           1007578976
Unhalted reference cycles            :            755945832
uops_retired.all                     :           4000097703
uops_retired.all<1                   :              8131817
uops_retired.all>=1                  :           1000053694
uops_retired.all>=2                  :           1000023800
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1015267723
Unhalted reference cycles            :            756582984
uops_retired.all>=3                  :           1000021575
uops_retired.all>=4                  :           1000011412
uops_retired.all>=5                  :                 1452
uops_retired.all>=6                  :                    0
Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1013085918
Unhalted reference cycles            :            758116368
inst_retired.any_p                   :           4000000086
inst_retired.any_p<1                 :             13696825
inst_retired.any_p>=1                :           1000002589
inst_retired.any_p>=2                :           1000000132
Instructions Issued                  :           4000000083
Unhalted core cycles                 :           1004281248
Unhalted reference cycles            :            745031496
inst_retired.any_p>=3                :            999997926
inst_retired.any_p>=4                :            999997925
inst_retired.any_p>=5                :                    0
inst_retired.any_p>=6                :                    0
Instructions Issued                  :           4000000086
Unhalted core cycles                 :           1018752394
Unhalted reference cycles            :            764101152
idq_uops_not_delivered.core          :             71912269
idq_uops_not_delivered.core<1        :           1001512943
idq_uops_not_delivered.core>=1       :             17989688
idq_uops_not_delivered.core>=2       :             17982564
Instructions Issued                  :           4000000081
Unhalted core cycles                 :           1007166725
Unhalted reference cycles            :            755495952
idq_uops_not_delivered.core>=3       :              6848823
idq_uops_not_delivered.core>=4       :              6844506
rs_events.empty                      :                    0
idq.empty                            :              6940084
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1012633828
Unhalted reference cycles            :            758772576
idq.mite_uops                        :             87578573
idq.dsb_uops                         :                56640
idq.ms_dsb_uops                      :                    0
idq.ms_mite_uops                     :               168161
Instructions Issued                  :           4000000088
Unhalted core cycles                 :           1013799250
Unhalted reference cycles            :            758772144
idq.mite_all_uops                    :            101773478
idq.mite_all_uops<1                  :            988984583
idq.mite_all_uops>=1                 :             25470706
idq.mite_all_uops>=2                 :             25443691
Instructions Issued                  :           4000000087
Unhalted core cycles                 :           1009164246
Unhalted reference cycles            :            758774400
idq.mite_all_uops>=3                 :             16246335
idq.mite_all_uops>=4                 :             16239687
move_elimination.int_not_eliminated  :                    0
move_elimination.simd_not_eliminated :                    0
Instructions Issued                  :           4000000089
Unhalted core cycles                 :           1018530294
Unhalted reference cycles            :            763961712
lsd.uops                             :           3863703268
lsd.uops<1                           :             53262230
lsd.uops>=1                          :            965925817
lsd.uops>=2                          :            965925817
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1012124380
Unhalted reference cycles            :            759399384
lsd.uops>=3                          :            978583021
lsd.uops>=4                          :            978583021
ild_stall.lcp                        :                    0
ild_stall.iq_full                    :                  863
Instructions Issued                  :           4000000087
Unhalted core cycles                 :           1008976349
Unhalted reference cycles            :            758008488
br_inst_exec.all_branches            :           1000009401
br_inst_exec.0x81                    :           1000009400
br_inst_exec.0x82                    :                    0
icache.misses                        :                  168
Instructions Issued                  :           4000000084
Unhalted core cycles                 :           1010302763
Unhalted reference cycles            :            758333856
br_misp_exec.all_branches            :                    2
br_misp_exec.0x81                    :                    1
br_misp_exec.0x82                    :                    0
fp_assist.any                        :                    0
Instructions Issued                  :           4000000082
Unhalted core cycles                 :           1008514841
Unhalted reference cycles            :            757761792
cpu_clk_unhalted.core_clk            :           1008510494
cpu_clk_unhalted.ref_xclk            :             31573233
baclears.any                         :                    0
idq.ms_uops                          :               164093
Run Code Online (Sandbox Code Playgroud)

没有开销了!您可以从固定功能计数器(例如最后一组打印输出)中看到IPC 4000000082 / 1008514841,大约4个IPC,757761792 / 2.4e9代码从0.31573408秒开始,从1008514841 / 757761792= 1.330912763941521,核心是Turbo Boosting到133% 2.4GHz,或3.2GHz.

  • 是的,我确实发现 `:u` 标志消除了_大部分_开销,但仍然保留了一点 - 但可以想象它仍然是由于中断的开销 - 请参阅[此页](http://web.eece. maine.edu/~vweaver/projects/perf_counters/retired_instructions.html)对于一些剩余的过度计数问题(即使使用 `:u` 时),还有一整篇[论文](http://icl.cs.utk.edu/ news_pub/submissions/ispass2013_definistic.pdf)如果你真的想深入研究(facft中有几篇论文)。 (2认同)