Bee*_*ope 27 performance x86 x86-64 cpu-architecture rdtsc
在最近的CPU上(至少在过去十年左右),除了各种可配置的性能计数器之外,英特尔还提供了三个固定功能硬件性能计数器.三个固定柜台是:
INST_RETIRED.ANY
CPU_CLK_UNHALTED.THREAD
CPU_CLK_UNHALTED.REF_TSC
Run Code Online (Sandbox Code Playgroud)
第一个计算退役指令,第二个计算实际周期,最后一个是我们感兴趣的."英特尔软件开发人员手册"第3卷的描述如下:
当核心未处于暂停状态而不处于TM停止时钟状态时,此事件计算TSC速率下的参考周期数.核心在运行HLT指令或MWAIT指令时进入暂停状态.此事件不受核心频率变化(例如,P状态)的影响,但计数与时间戳计数器的频率相同.当核心未处于暂停状态而不处于TM stopclock状态时,此事件可以估计经过的时间.
因此,对于CPU绑定循环,我希望该值与从中读取的自由运行TSC值相同rdstc,因为它们应该仅针对暂停的循环指令或"TM stopclock state"是什么发散.
我使用以下循环测试它(整个独立演示在github上可用):
for (int i = 0; i < 100; i++) {
PFC_CNT cnt[7] = {};
int64_t start = nanos();
PFCSTART(cnt);
int64_t tsc =__rdtsc();
busy_loop(CALIBRATION_LOOPS);
PFCEND(cnt);
int64_t tsc_delta = __rdtsc() - tsc;
int64_t nanos_delta = nanos() - start;
printf(CPU_W "d" REF_W ".2f" TSC_W ".2f" MHZ_W ".2f" RAT_W ".6f\n",
sched_getcpu(),
1000.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC] / nanos_delta,
1000.0 * tsc_delta / nanos_delta,
1000.0 * CALIBRATION_LOOPS / nanos_delta,
1.0 * cnt[PFC_FIXEDCNT_CPU_CLK_REF_TSC]/tsc_delta);
}
Run Code Online (Sandbox Code Playgroud)
在定时区域唯一重要的是busy_loop(CALIBRATION_LOOPS);它只是一个紧密的volatile存储循环,它由最近的硬件上的每次迭代编译gcc并clang执行一个周期:
void busy_loop(uint64_t iters) {
volatile int sink;
do {
sink = 0;
} while (--iters > 0);
(void)sink;
}
Run Code Online (Sandbox Code Playgroud)
该PFCSTART和PFCEND命令读取CPU_CLK_UNHALTED.REF_TSC使用计数器libpfc.这__rdtsc()是通过rdtsc指令读取TSC的内在函数.最后,我们测量的实时时间nanos()很简单:
int64_t nanos() {
auto t = std::chrono::high_resolution_clock::now();
return std::chrono::time_point_cast<std::chrono::nanoseconds>(t).time_since_epoch().count();
}
Run Code Online (Sandbox Code Playgroud)
是的,我没有发布cpuid,并且事情没有以确切的方式交错,但校准循环是一整秒,所以这样的纳秒级问题只是被稀释到或多或少没有.
启用TurboBoost后,这是我的i7-6700HQ Skylake CPU上典型运行的前几个结果:
CPU# REF_TSC rdtsc Eff Mhz Ratio
0 2392.05 2591.76 2981.30 0.922946
0 2381.74 2591.79 3032.86 0.918955
0 2399.12 2591.79 3032.50 0.925660
0 2385.04 2591.79 3010.58 0.920230
0 2378.39 2591.79 3010.21 0.917663
0 2355.84 2591.77 2928.96 0.908970
0 2364.99 2591.79 2942.32 0.912492
0 2339.64 2591.77 2935.36 0.902720
0 2366.43 2591.79 3022.08 0.913049
0 2401.93 2591.79 3023.52 0.926747
0 2452.87 2591.78 3070.91 0.946400
0 2350.06 2591.79 2961.93 0.906733
0 2340.44 2591.79 2897.58 0.903020
0 2403.22 2591.79 2944.77 0.927246
0 2394.10 2591.79 3059.58 0.923723
0 2359.69 2591.78 2957.79 0.910449
0 2353.33 2591.79 2916.39 0.907992
0 2339.58 2591.79 2951.62 0.902690
0 2395.82 2591.79 3017.59 0.924389
0 2353.47 2591.79 2937.82 0.908047
Run Code Online (Sandbox Code Playgroud)
这里,REF_TSC是如上所述的固定TSC性能计数器,并且rdtsc是来自rdtsc指令的结果.Eff Mhz是区间内有效计算的真实CPU频率,主要是为了好奇而显示,并且快速确认涡轮增压的数量.Ratio是REF_TSC和rdtsc列的比率.我预计这将非常接近1,但在实践中我们看到它在0.90到0.92之间徘徊并伴随着很多变化(我在其他运行中看到它低至0.8).
从图形上看,它看起来像这样2:
该rdstc呼叫返回几乎精确的结果1,而PMU TSC计数器遍布整个地方,有时几乎低至2300 MHz.
但是,如果我关闭turbo,结果会更加一致:
CPU# REF_TSC rdtsc Eff Mhz Ratio
0 2592.26 2592.25 2588.30 1.000000
0 2592.26 2592.26 2591.11 1.000000
0 2592.26 2592.26 2590.40 1.000000
0 2592.25 2592.25 2590.43 1.000000
0 2592.26 2592.26 2590.75 1.000000
0 2592.26 2592.26 2590.05 1.000000
0 2592.25 2592.25 2590.04 1.000000
0 2592.24 2592.24 2590.86 1.000000
0 2592.25 2592.25 2590.35 1.000000
0 2592.25 2592.25 2591.32 1.000000
0 2592.25 2592.25 2590.63 1.000000
0 2592.25 2592.25 2590.87 1.000000
0 2592.25 2592.25 2590.77 1.000000
0 2592.25 2592.25 2590.64 1.000000
0 2592.24 2592.24 2590.30 1.000000
0 2592.23 2592.23 2589.64 1.000000
0 2592.23 2592.23 2590.83 1.000000
0 2592.23 2592.23 2590.49 1.000000
0 2592.23 2592.23 2590.78 1.000000
0 2592.23 2592.23 2590.84 1.000000
0 2592.22 2592.22 2588.80 1.000000
Run Code Online (Sandbox Code Playgroud)
基本上,该比率是1.000000到6位小数.
以图形方式(Y轴刻度强制与前一图形相同):
现在代码只是运行一个热循环,并且应该没有hlt或mwait指令,当然没有任何意味着变化超过10%.我无法确定 "TM停止时钟周期"是什么,但我敢打赌它们是"热管理停止时钟周期",这是一种用于在达到最大温度时临时限制CPU的技巧.然而,我查看了集成的热敏电阻读数,我从未看到CPU突破60C,远远低于90C-100C,正常管理开始(我认为).
知道这可能是什么?是否存在隐含的"停止周期"以在不同的turbo频率之间转换?这肯定发生,因为盒子不安静,所以涡轮增压频率随着其他核心开始和停止工作在背景材料上跳跃(最大涡轮机频率直接取决于活动核心的数量:在我的盒子上它是3.5, 3.3,3.2,3.1 GHz分别用于1,2,3或4个核心活动.
1事实上,有一段时间我确实得到了精确的结果到两位小数:2591.97 MHz- 迭代后的迭代.然后一些事情发生了变化,我不确定是什么,结果中有一个约0.1%的微小变化rdstc.一种可能性是逐步时钟调整,由Linux定时子系统进行,以使本地晶体导出的时间与ntpd确定的时间一致.也许,它只是一个晶体漂移 - 上面的最后一个图表显示了每秒测量周期的稳定增长rdtsc.
2图形与文本中显示的值的运行不对应,因为每次更改文本输出格式时我都不会更新图形.然而,定性行为在每次运行时基本相同.
Iwi*_*ist 17
您在RDTSC和之间观察到的差异REFTSC是由TurboBoost P状态转换引起的.在这些转换过程中,大多数内核(包括固定功能性能计数器REF_TSC)将停止大约20000-21000个周期(8.5us),但仍以rdtsc其不变频率继续运行.rdtsc可能是在一个独立的电源和时钟域,因为它是如此重要,因为它记录了类似挂钟的行为.
RDTSC-REFTSC差异这种差异表现为RDTSC过度计数的趋势REFTSC.程序运行的时间越长,差异就越大RDTSC-REFTSC.在很长的延伸范围内,它可以安装高达1%-2%甚至更高.
当然,已经有人观察到,当禁用TurboBoost时,过度计数会消失,使用时可以按如下方式完成intel_pstate:
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo
Run Code Online (Sandbox Code Playgroud)
但这并没有告诉我们TurboBoost是否存在差异; 可能是由TurboBoost启用的较高P状态消耗了可用的净空,导致热量节流和停止.
TurboBoost是一种动态频率和电压调节解决方案,可在机会上利用工作范围(热量或电气)的裕量.如果可能,TurboBoost会将处理器的核心频率和电压放大到超过其标称值,从而以更高的功耗为代价提高性能.
更高的功耗当然会增加核心温度和功耗.最终,某种限制将被击中,TurboBoost将不得不降低性能.
我首先研究了热监控器1(TM1)或2(TM2)的热控制电路(TCC)是否导致热量节流.TM1通过插入TM停止时钟周期来降低功耗,这些是记录导致停止的条件之一REFTSC.另一方面,TM2不对时钟进行门控; 它只能缩放频率.
我修改libpfc()了以便能够读取选择的MSR,特别是MSR IA32_PACKAGE_THERM_STATUS和IA32_THERM_STATUSMSR.两者都包含只读状态和读写,硬件粘滞日志标志,用于各种热条件:
虽然有时会设置其中一些位(特别是在阻挡笔记本电脑通风口时!),但它们似乎与RDTSC过度计数无关,无论热状态如何都可以发生.
在SDM的其他地方挖掘类似停止时钟的硬件我遇到了HDC(硬件占空比),这是一种机制,操作系统可以手动请求CPU仅在固定比例的时间内运行; HDC硬件通过每16个时钟周期运行处理器1-15个时钟周期来实现这一点,并在该周期的剩余15-1个时钟周期内强制空闲.
HDC提供非常有用的寄存器,特别是MSR:
IA32_THREAD_STALL:计算由于此逻辑处理器上的强制空闲而停止的周期数.MSR_CORE_HDC_RESIDENCY:与上面相同,但对于物理处理器,计算此核心的一个或多个逻辑处理器强制空闲时的周期.MSR_PKG_HDC_SHALLOW_RESIDENCY:计算包处于C2状态并且至少一个逻辑处理器强制空闲的循环.MSR_PKG_HDC_DEEP_RESIDENCY:计算程序包处于更深(可精确配置)C状态并且至少一个逻辑处理器处于强制空闲状态的周期.有关详细信息,请参阅英特尔SDM第3卷,第14章,§14.5.1硬件占空比编程接口.
但我的i7-4700MQ 2.4 GHz CPU不支持HDC,那就是HDC.
在英特尔SDM中挖掘更多,我发现了一个非常非常多汁的MSR : MSR_CORE_PERF_LIMIT_REASONS. 该寄存器报告大量非常有用的状态和粘滞日志位:
690H MSR_CORE_PERF_LIMIT_REASONS - 包 - 处理器内核中频率削波的指示器
- 位
0:PROCHOT状态- 位
1:热状态- 位
4:图形驱动程序状态.设置后,由于处理器图形驱动程序覆盖,频率会降低到操作系统请求以下.- 位
5:基于自主利用的频率控制状态.设置后,频率会降低到操作系统请求以下,因为处理器检测到利用率很低.- Bit
6: Voltage Regulator Thermal Alert Status. When set, frequency is reduced below the operating system request due to a thermal alert from the Voltage Regulator.- Bit
8: Electrical Design Point Status. When set, frequency is reduced below the operating system request due to electrical design point constraints (e.g. maximum electrical current consumption).- Bit
9: Core Power Limiting Status. When set, frequency is reduced below the operating system request due to domain-level power limiting.- Bit
10: Package-Level Power Limiting PL1 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL1.- Bit
11: Package-Level Power Limiting PL2 Status. When set, frequency is reduced below the operating system request due to package-level power limiting PL2.- Bit
12: Max Turbo Limit Status. When set, frequency is reduced below the operating system request due to multi-core turbo limits.- Bit
13: Turbo Transition Attenuation Status. When set, frequency is reduced below the operating system request due to Turbo transition attenuation. This prevents performance degradation due to frequent operating ratio changes.- Bit
16: PROCHOT Log- Bit
17: Thermal Log- Bit
20: Graphics Driver Log- Bit
21: Autonomous Utilization-Based Frequency Control Log- Bit
22: Voltage Regulator Thermal Alert Log- Bit
24: Electrical Design Point Log- Bit
25: Core Power Limiting Log- Bit
26: Package-Level Power Limiting PL1 Log- Bit
27: Package-Level Power Limiting PL2 Log- Bit
28: Max Turbo Limit Log- Bit
29: Turbo Transition Attenuation Log
pfc.ko now supports this MSR, and a demo prints which of these log bits is active. The pfc.ko driver clears the sticky bits on every read.
I reran your experiments while printing the bits, and my CPU reports under very heavy load (all 4 cores/8 threads active) several limiting factors, including Electrical Design Point and Core Power Limiting. The Package-Level PL2 and Max Turbo Limit bits are always set on my CPU for reasons unknown to me. I also saw on occasion Turbo Transition Attenuation.
While none of these bits exactly correlated with the presence of the RDTSC-REFTSC discrepancy, the last bit gave me food for thought. The mere existence of Turbo Transition Attenuation implies that switching P-States has a substantial-enough cost that it must be rate-limited with some hysteresis mechanism. When I could not find an MSR that counted these transitions, I decided to do the next best thing - I'll use the magnitude of the RDTSC-REFTSC overcount to characterize the performance implications of a TurboBoost transition.
The experiment setup is as follows. On my i7-4700MQ CPU, nominal speed 2.4GHz and max Turbo Speed 3.4 GHz, I'll offline all cores except 0 (the boot processor) and 3 (a convenient victim core not numbered 0 and not a logical sibling of 0). We will then ask the intel_pstate driver to give us a package performance of no less than 98% and no higher than 100%; This constrains the processor to oscillate between the second-highest and highest P-states (3.3 GHz and 3.4 GHz). I do this as follows:
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu4/online
echo 0 > /sys/devices/system/cpu/cpu5/online
echo 0 > /sys/devices/system/cpu/cpu6/online
echo 0 > /sys/devices/system/cpu/cpu7/online
echo 98 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 100 > /sys/devices/system/cpu/intel_pstate/max_perf_pct
Run Code Online (Sandbox Code Playgroud)
I ran the demo application for 10000 samples at
1000, 1500, 2500, 4000, 6300,
10000, 15000, 25000, 40000, 63000,
100000, 150000, 250000, 400000, 630000,
1000000, 1500000, 2500000, 4000000, 6300000,
10000000, 15000000, 25000000, 40000000, 63000000
Run Code Online (Sandbox Code Playgroud)
nanoseconds per add_calibration() executed at nominal CPU frequency (multiply the numbers above by 2.4 to get the actual argument to add_calibration()).
This produces logs that look like this (case of 250000 nanos):
CPU 0, measured CLK_REF_TSC MHz : 2392.56
CPU 0, measured rdtsc MHz : 2392.46
CPU 0, measured add MHz : 3286.30
CPU 0, measured XREF_CLK time (s) : 0.00018200
CPU 0, measured delta time (s) : 0.00018258
CPU 0, measured tsc_delta time (s) : 0.00018200
CPU 0, ratio ref_tsc :ref_xclk : 24.00131868
CPU 0, ratio ref_core:ref_xclk : 33.00071429
CPU 0, ratio rdtsc :ref_xclk : 24.00032967
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -18
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2392.63
CPU 0, measured rdtsc MHz : 2392.62
CPU 0, measured add MHz : 3288.03
CPU 0, measured XREF_CLK time (s) : 0.00018192
CPU 0, measured delta time (s) : 0.00018248
CPU 0, measured tsc_delta time (s) : 0.00018192
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 32.99983509
CPU 0, ratio rdtsc :ref_xclk : 23.99989006
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2284.69
CPU 0, measured rdtsc MHz : 2392.63
CPU 0, measured add MHz : 3151.99
CPU 0, measured XREF_CLK time (s) : 0.00018121
CPU 0, measured delta time (s) : 0.00019036
CPU 0, measured tsc_delta time (s) : 0.00018977
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 33.38540919
CPU 0, ratio rdtsc :ref_xclk : 25.13393301
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : 20548
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018000000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
CPU 0, measured CLK_REF_TSC MHz : 2392.46
CPU 0, measured rdtsc MHz : 2392.45
CPU 0, measured add MHz : 3287.80
CPU 0, measured XREF_CLK time (s) : 0.00018192
CPU 0, measured delta time (s) : 0.00018249
CPU 0, measured tsc_delta time (s) : 0.00018192
CPU 0, ratio ref_tsc :ref_xclk : 24.00000000
CPU 0, ratio ref_core:ref_xclk : 32.99978012
CPU 0, ratio rdtsc :ref_xclk : 23.99989006
CPU 0, core CLK cycles in OS : 0
CPU 0, User-OS transitions : 0
CPU 0, rdtsc-reftsc overcount : -2
CPU 0, MSR_IA32_PACKAGE_THERM_STATUS : 000000008819080a
CPU 0, MSR_IA32_PACKAGE_THERM_INTERRUPT: 0000000000000003
CPU 0, MSR_CORE_PERF_LIMIT_REASONS : 0000000018001000
PROCHOT
Thermal
Graphics Driver
Autonomous Utilization-Based Frequency Control
Voltage Regulator Thermal Alert
Electrical Design Point (e.g. Current)
Core Power Limiting
Package-Level PL1 Power Limiting
* Package-Level PL2 Power Limiting
* Max Turbo Limit (Multi-Core Turbo)
Turbo Transition Attenuation
Run Code Online (Sandbox Code Playgroud)
I made several observations about the logs, but one stood out:
For nanos < ~250000, there is negligible RDTSC overcounting. For nanos > ~250000, one may reliably observe overcounting clock cycle quanta of just over 20000 clock cycles. But they are not due to User-OS transitions.
Here is a visual plot:
Saturated Blue Dots: 0 standard deviations (close to mean)
Saturated Red Dots: +3 standard deviations (above mean)
Saturated Green Dots: -3 standard deviations (below mean)
There is a marked difference before, during and after roughly 250000 nanoseconds of sustained decrementing.
Before the threshold, the CSV logs look like this:
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,-4,3639,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-44,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-14,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,12,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,10,0,0
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,32,3171,1
24.00,33.00,24.00,-20,0,0
24.00,33.00,24.00,10,0,0
Run Code Online (Sandbox Code Playgroud)
Indicating a TurboBoost ratio perfectly stable at 33x, an RDTSC that counts in synchrony with REFTSC at 24x the rate of REF_XCLK (100 MHz), negligible overcounting, typically 0 cycles spent in the kernel and thus 0 transitions into the kernel. Kernel interrupts take approximately 3000 reference cycles to service.
At the critical threshold, the log contains clumps of 20000 cycle overcounts, and the overcounts correlate very well with non-integer estimated multiplier values between 33x and 34x:
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,2,0,0
24.00,33.00,24.00,22,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.05,25.11,20396,0,0
24.00,33.38,25.12,20212,0,0
24.00,33.39,25.12,20308,0,0
24.00,33.42,25.12,20296,0,0
24.00,33.43,25.11,20158,0,0
24.00,33.43,25.11,20178,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.00,24.00,20,3920,1
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-4,0,0
24.00,33.44,25.13,20396,0,0
24.00,33.46,25.11,20156,0,0
24.00,33.46,25.12,20268,0,0
24.00,33.41,25.12,20322,0,0
24.00,33.40,25.11,20216,0,0
24.00,33.46,25.12,20168,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,-2,0,0
24.00,33.00,24.00,22,0,0
Run Code Online (Sandbox Code Playgroud)
The TurboBoost from 3.3 GHz to 3.4 GHz now happens reliably. As the nanos increase, the logs are filled with roughly integer multiples of 20000-cycle quanta. Eventually there are so many nanos that the Linux scheduler interrupts become permanent fixtures, but preemption is easily detected with the performance counters, and its effect is not at all similar to the TurboBoost halts.
24.00,33.75,24.45,20166,0,0
24.00,33.78,24.45,20302,0,0
24.00,33.78,24.45,20202,0,0
24.00,33.68,24.91,41082,0,0
24.00,33.31,24.90,40998,0,0
24.00,33.70,25.30,58986,3668,1
24.00,33.74,24.42,18798,0,0
24.00,33.74,24.45,20172,0,0
24.00,33.77,24.45,20156,0,0
24.00,33.78,24.45,20258,0,0
24.00,33.78,24.45,20240,0,0
24.00,33.77,24.42,18826,0,0
24.00,33.75,24.45,20372,0,0
24.00,33.76,24.42,18798,4081,1
24.00,33.74,24.41,18460,0,0
24.00,33.75,24.45,20234,0,0
24.00,33.77,24.45,20284,0,0
24.00,33.78,24.45,20150,0,0
24.00,33.78,24.45,20314,0,0
24.00,33.78,24.42,18766,0,0
24.00,33.71,25.36,61608,0,0
24.00,33.76,24.45,20336,0,0
24.00,33.78,24.45,20234,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.78,24.45,20210,0,0
24.00,33.00,24.00,-10,0,0
24.00,33.00,24.00,4,0,0
24.00,33.00,24.00,18,0,0
24.00,33.00,24.00,2,4132,1
24.00,33.00,24.00,44,0,0
Run Code Online (Sandbox Code Playgroud)
The TurboBoost machinery is responsible for the discrepancy in RDTSC-REFTSC. This discrepancy can be used to determine that a TurboBoost state transition from 3.3 GHz to 3.4 GHz takes approximately 20500 reference clock cycles (8.5us), and is triggered no later than about 250000 ns (250us; 600000 reference clock cycles) after entry into add_reference(), when the processor decides that the workload is sufficiently intense as to deserve a frequency-voltage scaling.
需要进行更多研究以确定转换成本如何随频率变化,以及是否可以调整选择电源状态的硬件.对我来说特别感兴趣的是"Turbo Attenuation Units",我在网络的远端看到了这些暗示.也许Turbo硬件有一个可配置的时间窗口?目前,决定转换时间的时间比为30:1(600us:20us).可以调整吗?