英特尔失去了周期?rdtsc和CPU_CLK_UNHALTED.REF_TSC之间的不一致

Bee*_*ope 27 performance x86 x86-64 cpu-architecture rdtsc

在最近的CPU上(至少在过去十年左右),除了各种可配置的性能计数器之外,英特尔还提供了三个固定功能硬件性能计数器.三个固定柜台是:

INST_RETIRED.ANY
CPU_CLK_UNHALTED.THREAD
CPU_CLK_UNHALTED.REF_TSC
Run Code Online (Sandbox Code Playgroud)

第一个计算退役指令,第二个计算实际周期,最后一个是我们感兴趣的."英特尔软件开发人员手册"第3卷的描述如下:

当核心未处于暂停状态而不处于TM停止时钟状态时,此事件计算TSC速率下的参考周期数.核心在运行HLT指令或MWAIT指令时进入暂停状态.此事件不受核心频率变化(例如,P状态)的影响,但计数与时间戳计数器的频率相同.当核心未处于暂停状态而不处于TM stopclock状态时,此事件可以估计经过的时间.

因此,对于CPU绑定循环,我希望该值与从中读取的自由运行TSC值相同rdstc,因为它们应该仅针对暂停的循环指令或"TM stopclock state"是什么发散.

我使用以下循环测试它(整个独立演示在github上可用):

for (int i = 0; i < 100; i++) {
    PFC_CNT cnt[7] = {};

    int64_t start = nanos();
    PFCSTART(cnt);
    int64_t tsc =__rdtsc();
    busy_loop(CALIBRATION_LOOPS);
    PFCEND(cnt);
    int64_t tsc_delta   = __rdtsc() - tsc;
    int64_t nanos_delta = nanos() - start;

    printf(CPU_W "d" REF_W ".2f" TSC_W ".2f" MHZ_W ".2f" RAT_W ".6f\n",
            sched_getcpu(),
            1000.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC] / nanos_delta,
            1000.0 * tsc_delta / nanos_delta,
            1000.0 * CALIBRATION_LOOPS / nanos_delta,
            1.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC]/tsc_delta);
}
Run Code Online (Sandbox Code Playgroud)

在定时区域唯一重要的是busy_loop(CALIBRATION_LOOPS);它只是一个紧密的volatile存储循环,它由最近的硬件上的每次迭代编译gccclang执行一个周期:

void busy_loop(uint64_t iters) {
    volatile int sink;
    do {
        sink = 0;
    } while (--iters > 0);
    (void)sink;
}
Run Code Online (Sandbox Code Playgroud)

PFCSTARTPFCEND命令读取CPU_CLK_UNHALTED.REF_TSC使用计数器libpfc.这__rdtsc()是通过rdtsc指令读取TSC的内在函数.最后,我们测量的实时时间nanos()很简单:

int64_t nanos() {
    auto t = std::chrono::high_resolution_clock::now();
    return std::chrono::time_point_cast<std::chrono::nanoseconds>(t).time_since_epoch().count();
}
Run Code Online (Sandbox Code Playgroud)

是的,我没有发布cpuid,并且事情没有以确切的方式交错,但校准循环是一整秒,所以这样的纳秒级问题只是被稀释到或多或少没有.

启用TurboBoost后,这是我的i7-6700HQ Skylake CPU上典型运行的前几个结果:

CPU# REF_TSC   rdtsc Eff Mhz     Ratio
   0 2392.05 2591.76 2981.30  0.922946
   0 2381.74 2591.79 3032.86  0.918955
   0 2399.12 2591.79 3032.50  0.925660
   0 2385.04 2591.79 3010.58  0.920230
   0 2378.39 2591.79 3010.21  0.917663
   0 2355.84 2591.77 2928.96  0.908970
   0 2364.99 2591.79 2942.32  0.912492
   0 2339.64 2591.77 2935.36  0.902720
   0 2366.43 2591.79 3022.08  0.913049
   0 2401.93 2591.79 3023.52  0.926747
   0 2452.87 2591.78 3070.91  0.946400
   0 2350.06 2591.79 2961.93  0.906733
   0 2340.44 2591.79 2897.58  0.903020
   0 2403.22 2591.79 2944.77  0.927246
   0 2394.10 2591.79 3059.58  0.923723
   0 2359.69 2591.78 2957.79  0.910449
   0 2353.33 2591.79 2916.39  0.907992
   0 2339.58 2591.79 2951.62  0.902690
   0 2395.82 2591.79 3017.59  0.924389
   0 2353.47 2591.79 2937.82  0.908047
Run Code Online (Sandbox Code Playgroud)

这里,REF_TSC是如上所述的固定TSC性能计数器,并且rdtsc是来自rdtsc指令的结果.Eff Mhz是区间内有效计算的真实CPU频率,主要是为了好奇而显示,并且快速确认涡轮增压的数量.RatioREF_TSCrdtsc列的比率.我预计这将非常接近1,但在实践中我们看到它在0.90到0.92之间徘徊并伴随着很多变化(我在其他运行中看到它低至0.8).

从图形上看,它看起来像这样2:

PMU tsc vs rdstc

rdstc呼叫返回几乎精确的结果1,而PMU TSC计数器遍布整个地方,有时几乎低至2300 MHz.

但是,如果我关闭turbo,结果会更加一致:

CPU# REF_TSC   rdtsc Eff Mhz     Ratio
   0 2592.26 2592.25 2588.30  1.000000
   0 2592.26 2592.26 2591.11  1.000000
   0 2592.26 2592.26 2590.40  1.000000
   0 2592.25 2592.25 2590.43  1.000000
   0 2592.26 2592.26 2590.75  1.000000
   0 2592.26 2592.26 2590.05  1.000000
   0 2592.25 2592.25 2590.04  1.000000
   0 2592.24 2592.24 2590.86  1.000000
   0 2592.25 2592.25 2590.35  1.000000
   0 2592.25 2592.25 2591.32  1.000000
   0 2592.25 2592.25 2590.63  1.000000
   0 2592.25 2592.25 2590.87  1.000000
   0 2592.25 2592.25 2590.77  1.000000
   0 2592.25 2592.25 2590.64  1.000000
   0 2592.24 2592.24 2590.30  1.000000
   0 2592.23 2592.23 2589.64  1.000000
   0 2592.23 2592.23 2590.83  1.000000
   0 2592.23 2592.23 2590.49  1.000000
   0 2592.23 2592.23 2590.78  1.000000
   0 2592.23 2592.23 2590.84  1.000000
   0 2592.22 2592.22 2588.80  1.000000
Run Code Online (Sandbox Code Playgroud)

基本上,该比率是1.000000到6位小数.

以图形方式(Y轴刻度强制与前一图形相同):

PMU vs rdtsc(没有涡轮增压)

现在代码只是运行一个热循环,并且应该没有hltmwait指令,当然没有任何意味着变化超过10%.我无法确定 "TM停止时钟周期"是什么,但我敢打赌它们是"热管理停止时钟周期",这是一种用于在达到最大温度时临时限制CPU的技巧.然而,我查看了集成的热敏电阻读数,我从未看到CPU突破60C,远远低于90C-100C,正常管理开始(我认为).

知道这可能是什么?是否存在隐含的"停止周期"以在不同的turbo频率之间转换?这肯定发生,因为盒子不安静,所以涡轮增压频率随着其他核心开始和停止工作在背景材料上跳跃(最大涡轮机频率直接取决于活动核心的数量:在我的盒子上它是3.5, 3.3,3.2,3.1 GHz分别用于1,2,3或4个核心活动.


1事实上,有一段时间我确实得到了精确的结果到两位小数:2591.97 MHz- 迭代后的迭代.然后一些事情发生了变化,我不确定是什么,结果中有一个约0.1%的微小变化rdstc.一种可能性是逐步时钟调整,由Linux定时子系统进行,以使本地晶体导出的时间与ntpd确定的时间一致.也许,它只是一个晶体漂移 - 上面的最后一个图表显示了每秒测量周期的稳定增长rdtsc.

2图形与文本中显示的值的运行不对应,因为每次更改文本输出格式时我都不会更新图形.然而,定性行为在每次运行时基本相同.

Iwi*_*ist 17

TL; DR

您在RDTSC和之间观察到的差异REFTSC是由TurboBoost P状态转换引起的.在这些转换过程中,大多数内核(包括固定功能性能计数器REF_TSC)将停止大约20000-21000个周期(8.5us),但仍以rdtsc其不变频率继续运行.rdtsc可能是在一个独立的电源和时钟域,因为它是如此重要,因为它记录了类似挂钟的行为.

RDTSC-REFTSC差异

这种差异表现为RDTSC过度计数的趋势REFTSC.程序运行的时间越长,差异就越大RDTSC-REFTSC.在很长的延伸范围内,它可以安装高达1%-2%甚至更高.

当然,已经有人观察到,当禁用TurboBoost时,过度计数会消失,使用时可以按如下方式完成intel_pstate:

echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Run Code Online (Sandbox Code Playgroud)

但这并没有告诉我们TurboBoost是否存在差异; 可能是由TurboBoost启用的较高P状态消耗了可用的净空,导致热量节流和停止.

可能的节流?

TurboBoost是一种动态频率和电压调节解决方案,可在机会上利用工作范围(热量或电气)的裕量.如果可能,TurboBoost会将处理器的核心频率和电压放大到超过其标称值,从而以更高的功耗为代价提高性能.

更高的功耗当然会增加核心温度和功耗.最终,某种限制将被击中,TurboBoost将不得不降低性能.

TM1热量节流?

我首先研究了热监控器1(TM1)或2(TM2)的热控制电路(TCC)是否导致热量节流.TM1通过插入TM停止时钟周期来降低功耗,这些是记录导致停止的条件之一REFTSC.另一方面,TM2不对时钟进行门控; 它只能缩放频率.

我修改libpfc()了以便能够读取选择的MSR,特别是MSR IA32_PACKAGE_THERM_STATUSIA32_THERM_STATUSMSR.两者都包含只读状态和读写,硬件粘滞日志标志,用于各种热条件:

IA32_THERM_STATUS注册内容 (IA32_PACKAGE_THERM_STATUS寄存器基本相同)

虽然有时会设置其中一些位(特别是在阻挡笔记本电脑通风口时!),但它们似乎与RDTSC过度计数无关,无论热状态如何都可以发生.

硬件自行车骑行?C-State Residency?

在SDM的其他地方挖掘类似停止时钟的硬件我遇到了HDC(硬件占空比),这是一种机制,操作系统可以手动请求CPU仅在固定比例的时间内运行; HDC硬件通过每16个时钟周期运行处理器1-15个时钟周期来实现这一点,并在该周期的剩余15-1个时钟周期内强制空闲.

HDC提供非常有用的寄存器,特别是MSR:

  • IA32_THREAD_STALL:计算由于此逻辑处理器上的强制空闲而停止的周期数.
  • MSR_CORE_HDC_RESIDENCY:与上面相同,但对于物理处理器,计算此核心的一个或多个逻辑处理器强制空闲时的周期.
  • MSR_PKG_HDC_SHALLOW_RESIDENCY:计算包处于C2状态并且至少一个逻辑处理器强制空闲的循环.
  • MSR_PKG_HDC_DEEP_RESIDENCY:计算程序包处于更深(可精确配置)C状态并且至少一个逻辑处理器处于强制空闲状态的周期.

有关详细信息,请参阅英特尔SDM第3卷,第14章,§14.5.1硬件占空比编程接口.

但我的i7-4700MQ 2.4 GHz CPU不支持HDC,那就是HDC.

其他限制来源?

在英特尔SDM中挖掘更多,我发现了一个非常非常多汁的MSR : MSR_CORE_PERF_LIMIT_REASONS. 该寄存器报告大量非常有用的状态和粘滞日志位:

690H MSR_CORE_PERF_LIMIT_REASONS - 包 - 处理器内核中频率削波的指示器

  • 0:PROCHOT状态
  • 1:热状态
  • 4:图形驱动程序状态.设置后,由于处理器图形驱动程序覆盖,频率会降低到操作系统请求以下.
  • 5:基于自主利用的频率控制状态.设置后,频率会降低到操作系统请求以下,因为处理器检测到利用率很低.
  • Bit 6: Voltage Regulator Thermal Alert Status. When set, frequency is reduced below the operating system request due to a thermal alert from the Voltage Regulator.
  • Bit 8: Electrical Design Point Status. When set, frequency is reduced below the operating system request due to electrical design point constraints (e.g. maximum electrical current consumption).
  • Bit 9: Core Power Limiting Status. When set, frequency is reduced below the operating system request due to domain-level power limiting.
  • Bit 10: Package-Level Power Limiting PL1 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL1.
  • Bit 11: Package-Level Power Limiting PL2 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL2.
  • Bit 12: Max Turbo Limit Status. When set, frequency is reduced below the operating system request due to multi-core turbo limits.
  • Bit 13: Turbo Transition Attenuation Status. When set, frequency is reduced below the operating system request due to Turbo transition attenuation. This prevents performance degradation due to frequent operating ratio changes.
  • Bit 16: PROCHOT Log
  • Bit 17: Thermal Log
  • Bit 20: Graphics Driver Log
  • Bit 21: Autonomous Utilization-Based Frequency Control Log
  • Bit 22: Voltage Regulator Thermal Alert Log
  • Bit 24: Electrical Design Point Log
  • Bit 25: Core Power Limiting Log
  • Bit 26: Package-Level Power Limiting PL1 Log
  • Bit 27: Package-Level Power Limiting PL2 Log
  • Bit 28: Max Turbo Limit Log
  • Bit 29: Turbo Transition Attenuation Log

pfc.ko now supports this MSR, and a demo prints which of these log bits is active. The pfc.ko driver clears the sticky bits on every read.

I reran your experiments while printing the bits, and my CPU reports under very heavy load (all 4 cores/8 threads active) several limiting factors, including Electrical Design Point and Core Power Limiting. The Package-Level PL2 and Max Turbo Limit bits are always set on my CPU for reasons unknown to me. I also saw on occasion Turbo Transition Attenuation.

While none of these bits exactly correlated with the presence of the RDTSC-REFTSC discrepancy, the last bit gave me food for thought. The mere existence of Turbo Transition Attenuation implies that switching P-States has a substantial-enough cost that it must be rate-limited with some hysteresis mechanism. When I could not find an MSR that counted these transitions, I decided to do the next best thing - I'll use the magnitude of the RDTSC-REFTSC overcount to characterize the performance implications of a TurboBoost transition.

Experiment

The experiment setup is as follows. On my i7-4700MQ CPU, nominal speed 2.4GHz and max Turbo Speed 3.4 GHz, I'll offline all cores except 0 (the boot processor) and 3 (a convenient victim core not numbered 0 and not a logical sibling of 0). We will then ask the intel_pstate driver to give us a package performance of no less than 98% and no higher than 100%; This constrains the processor to oscillate between the second-highest and highest P-states (3.3 GHz and 3.4 GHz). I do this as follows:

echo   0 > /sys/devices/system/cpu/cpu1/online
echo   0 > /sys/devices/system/cpu/cpu2/online
echo   0 > /sys/devices/system/cpu/cpu4/online
echo   0 > /sys/devices/system/cpu/cpu5/online
echo   0 > /sys/devices/system/cpu/cpu6/online
echo   0 > /sys/devices/system/cpu/cpu7/online
echo  98 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
Run Code Online (Sandbox Code Playgroud)

I ran the demo application for 10000 samples at

1000,     1500,     2500,     4000,     6300,
10000,    15000,    25000,    40000,    63000,
100000,   150000,   250000,   400000,   630000,
1000000,  1500000,  2500000,  4000000,  6300000,
10000000, 15000000, 25000000, 40000000, 63000000
Run Code Online (Sandbox Code Playgroud)

nanoseconds per add_calibration() executed at nominal CPU frequency (multiply the numbers above by 2.4 to get the actual argument to add_calibration()).

Results

This produces logs that look like this (case of 250000 nanos):

CPU 0, measured CLK_REF_TSC MHz        :          2392.56
CPU 0, measured rdtsc MHz              :          2392.46
CPU 0, measured add   MHz              :          3286.30
CPU 0, measured XREF_CLK  time (s)     :       0.00018200
CPU 0, measured delta     time (s)     :       0.00018258
CPU 0, measured tsc_delta time (s)     :       0.00018200
CPU 0, ratio ref_tsc :ref_xclk         :      24.00131868
CPU 0, ratio ref_core:ref_xclk         :      33.00071429
CPU 0, ratio rdtsc   :ref_xclk         :      24.00032967
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :              -18
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2392.63
CPU 0, measured rdtsc MHz              :          2392.62
CPU 0, measured add   MHz              :          3288.03
CPU 0, measured XREF_CLK  time (s)     :       0.00018192
CPU 0, measured delta     time (s)     :       0.00018248
CPU 0, measured tsc_delta time (s)     :       0.00018192
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      32.99983509
CPU 0, ratio rdtsc   :ref_xclk         :      23.99989006
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :               -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2284.69
CPU 0, measured rdtsc MHz              :          2392.63
CPU 0, measured add   MHz              :          3151.99
CPU 0, measured XREF_CLK  time (s)     :       0.00018121
CPU 0, measured delta     time (s)     :       0.00019036
CPU 0, measured tsc_delta time (s)     :       0.00018977
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      33.38540919
CPU 0, ratio rdtsc   :ref_xclk         :      25.13393301
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :            20548
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018000000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz        :          2392.46
CPU 0, measured rdtsc MHz              :          2392.45
CPU 0, measured add   MHz              :          3287.80
CPU 0, measured XREF_CLK  time (s)     :       0.00018192
CPU 0, measured delta     time (s)     :       0.00018249
CPU 0, measured tsc_delta time (s)     :       0.00018192
CPU 0, ratio ref_tsc :ref_xclk         :      24.00000000
CPU 0, ratio ref_core:ref_xclk         :      32.99978012
CPU 0, ratio rdtsc   :ref_xclk         :      23.99989006
CPU 0, core CLK cycles in OS           :                0
CPU 0, User-OS transitions             :                0
CPU 0, rdtsc-reftsc overcount          :               -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS   : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS     : 0000000018001000
        PROCHOT
        Thermal
        Graphics Driver
        Autonomous Utilization-Based Frequency Control
        Voltage Regulator Thermal Alert
        Electrical Design Point (e.g. Current)
        Core Power Limiting
        Package-Level PL1 Power Limiting
      * Package-Level PL2 Power Limiting
      * Max Turbo Limit (Multi-Core Turbo)
        Turbo Transition Attenuation
Run Code Online (Sandbox Code Playgroud)

I made several observations about the logs, but one stood out:

For nanos < ~250000, there is negligible RDTSC overcounting. For nanos > ~250000, one may reliably observe overcounting clock cycle quanta of just over 20000 clock cycles. But they are not due to User-OS transitions.

Here is a visual plot:

该图显示了量化的TurboBoost转换惩罚 Saturated Blue Dots: 0 standard deviations (close to mean)

Saturated Red Dots: +3 standard deviations (above mean)

Saturated Green Dots: -3 standard deviations (below mean)

There is a marked difference before, during and after roughly 250000 nanoseconds of sustained decrementing.

Nanos < 250000

Before the threshold, the CSV logs look like this:

24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,-4,3639,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-44,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,12,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,32,3171,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
Run Code Online (Sandbox Code Playgroud)

Indicating a TurboBoost ratio perfectly stable at 33x, an RDTSC that counts in synchrony with REFTSC at 24x the rate of REF_XCLK (100 MHz), negligible overcounting, typically 0 cycles spent in the kernel and thus 0 transitions into the kernel. Kernel interrupts take approximately 3000 reference cycles to service.

Nanos == 250000

At the critical threshold, the log contains clumps of 20000 cycle overcounts, and the overcounts correlate very well with non-integer estimated multiplier values between 33x and 34x:

24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,2,0,0
24.00,33.00,24.00,22,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.05,25.11,20396,0,0
24.00,33.38,25.12,20212,0,0
24.00,33.39,25.12,20308,0,0
24.00,33.42,25.12,20296,0,0
24.00,33.43,25.11,20158,0,0
24.00,33.43,25.11,20178,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.00,24.00,20,3920,1
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.44,25.13,20396,0,0
24.00,33.46,25.11,20156,0,0
24.00,33.46,25.12,20268,0,0
24.00,33.41,25.12,20322,0,0
24.00,33.40,25.11,20216,0,0
24.00,33.46,25.12,20168,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,22,0,0
Run Code Online (Sandbox Code Playgroud)

Nanos > 250000

The TurboBoost from 3.3 GHz to 3.4 GHz now happens reliably. As the nanos increase, the logs are filled with roughly integer multiples of 20000-cycle quanta. Eventually there are so many nanos that the Linux scheduler interrupts become permanent fixtures, but preemption is easily detected with the performance counters, and its effect is not at all similar to the TurboBoost halts.

24.00,33.75,24.45,20166,0,0
24.00,33.78,24.45,20302,0,0
24.00,33.78,24.45,20202,0,0
24.00,33.68,24.91,41082,0,0
24.00,33.31,24.90,40998,0,0
24.00,33.70,25.30,58986,3668,1
24.00,33.74,24.42,18798,0,0
24.00,33.74,24.45,20172,0,0
24.00,33.77,24.45,20156,0,0
24.00,33.78,24.45,20258,0,0
24.00,33.78,24.45,20240,0,0
24.00,33.77,24.42,18826,0,0
24.00,33.75,24.45,20372,0,0
24.00,33.76,24.42,18798,4081,1
24.00,33.74,24.41,18460,0,0
24.00,33.75,24.45,20234,0,0
24.00,33.77,24.45,20284,0,0
24.00,33.78,24.45,20150,0,0
24.00,33.78,24.45,20314,0,0
24.00,33.78,24.42,18766,0,0
24.00,33.71,25.36,61608,0,0
24.00,33.76,24.45,20336,0,0
24.00,33.78,24.45,20234,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.00,24.00,-10,0,0
24.00,33.00,24.00,4,0,0
24.00,33.00,24.00,18,0,0
24.00,33.00,24.00,2,4132,1
24.00,33.00,24.00,44,0,0
Run Code Online (Sandbox Code Playgroud)

Conclusions

The TurboBoost machinery is responsible for the discrepancy in RDTSC-REFTSC. This discrepancy can be used to determine that a TurboBoost state transition from 3.3 GHz to 3.4 GHz takes approximately 20500 reference clock cycles (8.5us), and is triggered no later than about 250000 ns (250us; 600000 reference clock cycles) after entry into add_reference(), when the processor decides that the workload is sufficiently intense as to deserve a frequency-voltage scaling.

Future Work

需要进行更多研究以确定转换成本如何随频率变化,以及是否可以调整选择电源状态的硬件.对我来说特别感兴趣的是"Turbo Attenuation Units",我在网络的远端看到了这些暗示.也许Turbo硬件有一个可配置的时间窗口?目前,决定转换时间的时间比为30:1(600us:20us).可以调整吗?

  • @BeeOnRope来自CORE_PERF_LIMIT_REASONS MSR的文档,_"位13:Turbo转换衰减状态.设置后,由于Turbo转换衰减,频率降低到操作系统请求以下.这可以防止因频繁操作比率变化导致性能下降." _对我来说,这意味着硬件低于给定所有其他包络条件的位置,但滞后定时器在最近的过去检测到过多的转换,并且当我们处于较低的P状态时拒绝升级.它起到一种振荡阻尼器的作用. (3认同)