Cow*_*gun 313 c++ optimization x86 intel cpu-architecture
我一直在绞尽脑汁想要完成这项任务一周,我希望有人能带领我走向正确的道路.让我从教师的指示开始:
您的作业与我们的第一个实验作业相反,即优化素数计划.你在这个任务中的目的是使程序失望,即让它运行得更慢.这两个都是CPU密集型程序.他们需要几秒钟才能在我们的实验室电脑上运行.您可能无法更改算法.
要取消优化程序,请使用您对英特尔i7管道如何运行的了解.想象一下重新排序指令路径以引入WAR,RAW和其他危险的方法.想一想最小化缓存有效性的方法.恶魔无能.
该作业选择了Whetstone或Monte-Carlo程序.缓存有效性评论大多只适用于Whetstone,但我选择了Monte-Carlo模拟程序:
// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm> // Needed for the "max" function
#include <cmath>
#include <iostream>
// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
double x = 0.0;
double y = 0.0;
double euclid_sq = 0.0;
// Continue generating two uniform random variables
// until the square of their "euclidean distance"
// is less than unity
do {
x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
euclid_sq = x*x + y*y;
} while (euclid_sq >= 1.0);
return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}
// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(S_cur - K, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
double S_adjust = S * exp(T*(r-0.5*v*v));
double S_cur = 0.0;
double payoff_sum = 0.0;
for (int i=0; i<num_sims; i++) {
double gauss_bm = gaussian_box_muller();
S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
payoff_sum += std::max(K - S_cur, 0.0);
}
return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}
int main(int argc, char **argv) {
// First we create the parameter list
int num_sims = 10000000; // Number of simulated asset paths
double S = 100.0; // Option price
double K = 100.0; // Strike price
double r = 0.05; // Risk-free rate (5%)
double v = 0.2; // Volatility of the underlying (20%)
double T = 1.0; // One year until expiry
// Then we calculate the call/put values via Monte Carlo
double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
double put = monte_carlo_put_price(num_sims, S, K, r, v, T);
// Finally we output the parameters and prices
std::cout << "Number of Paths: " << num_sims << std::endl;
std::cout << "Underlying: " << S << std::endl;
std::cout << "Strike: " << K << std::endl;
std::cout << "Risk-Free Rate: " << r << std::endl;
std::cout << "Volatility: " << v << std::endl;
std::cout << "Maturity: " << T << std::endl;
std::cout << "Call Price: " << call << std::endl;
std::cout << "Put Price: " << put << std::endl;
return 0;
}
Run Code Online (Sandbox Code Playgroud)
我所做的更改似乎将代码运行时间增加了一秒,但我不完全确定在不添加代码的情况下我可以更改以停止管道.指向正确的方向将是非常棒的,我感谢任何回应.
亮点是:
CPUID
指令以及如何确定缓存大小,以及内在函数和CLFLUSH
指令.Cowmoogun对元线程的评论表明,目前尚不清楚编译器优化可能是其中的一部分,并假设-O0
,并且运行时间增加17%是合理的.
所以听起来这个任务的目标是让学生重新排序现有的工作,以减少指令级并行性或类似的事情,但人们深入研究并学到更多东西并不是一件坏事.
请记住,这是一个计算机架构问题,而不是关于如何使C++变得缓慢的问题.
Pet*_*des 399
Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.
Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0
, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.
Problems with the assignment wording and code:
The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp
/log
library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.
I'd love to see an answer that attempted to get a slowdown from re-arranging the expressions to change the dependencies, to reduce ILP just from dependencies (hazards). I haven't attempted it.
Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.
WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt
/lzcnt
/tzcnt
, which have a false dependency their destination on Intel CPUs, even though it's write-only. i.e. WAW being handled as a RAW hazard + a write). For memory ordering, modern CPUs use store queues to delay commit into cache until retirement, also avoiding WAR and WAW hazards.
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? has more about register renaming and hiding FMA latency in an FP dot product loop.
The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say "Core i7" when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls and ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)
Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.
std::atomic<uint64_t>
loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586
. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).-
FP变量,使用0x80对高字节进行异或,以翻转符号位,从而导致存储转发停顿.RDTSC
.例如CPUID
/ RDTSC
或进行系统调用的时间函数.序列化指令本质上是管道不友好的.vzeroupper
before calls to scalar math-library exp()
and log()
functions, causing AVX<->SSE transition stalls.Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different/worse asm.
也许使用OpenMP来进行多线程循环,迭代次数很少,而且开销比速度增益更多.你的monte-carlo代码有足够的并行性来实际获得加速,尤其是.如果我们成功地使每次迭代变慢.(每个线程计算一个部分payoff_sum
,最后添加). #omp parallel
在那个循环上可能是一个优化,而不是悲观.
Multi-thread but force both threads to share the same loop counter (with atomic
increments so the total number of iterations is correct). This seems diabolically logical. This means using a static
variable as a loop counter. This justifies use of atomic
for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock inc
. And lock cmpxchg8b
to atomically increment a contended uint64_t
on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc
.
Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a lock
ed instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (除非你使用pause
).没有lock
ed指令的真正共享通常是一个错误.将非原子共享循环计数器与原子情况进行比较会很有趣.要真正地保持悲观,请保留共享原子循环计数器,并在相同或不同的高速缓存行中导致其他变量的错误共享.
如果你可以引入任何不可预测的分支,那将大大减少代码.现代x86 CPU具有相当长的流水线,因此误预测需要大约15个周期(从uop缓存运行时).
我认为这是作业的预期部分之一.
Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math
, because that can change the results (as discussed below).
To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum +=
at the end of the previous iteration. (log()
and exp
take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).
The RNG state will almost certainly be a longer loop-carried dependency chain than the addps
.
Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd
/divpd
is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm
, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).
The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0);
is clearly testing for a distance, so clearly it would be proper to sqrt()
it. :P (sqrt
is even slower than div
).
As @Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math
to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v))
could become exp(T*r - T*v*v/2.0)
. Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math
isn't on by default). See Paul's comment for a very hairy nested pow()
suggestion.
If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.
<immintrin.h>
)Use movnti
to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush
is probably impossible without malice.
Use integer shuffles between FP math operations to cause bypass delays.
Mixing SSE and AVX instructions without proper use of vzeroupper
causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double
.
exp()
and log()
don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp
without doing a vzeroupper
first, then you stall. After returning, an AVX-128 instruction like vmovsd
to set up the next vector element as an arg for exp
will also stall. And then exp()
will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks @ZBoson).
See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp()
and so on.
If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx
after an 8 or 16bit operation, but here's a case where gcc modifies ah
and then reads ax
With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN
using many single-byte nop
s instead of a couple long nop
s on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.
Use self-modifying code to trigger pipeline clears (aka machine-nukes).
LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel
and when it could have used a 32bit instruction.
A common idiom for timing is CPUID
(to serialize) then RDTSC
. Time every iteration separately with a CPUID
/RDTSC
to make sure the RDTSC
isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).
Use a union { double d; char a[8]; }
for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double
using XOR 0x80 on just the high byte, instead of a -
operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A very good compiler targeting FP math in SSE registers may possibly compile this to an xorps
with a constant in another xmm register, but the only way this isn't terrible for x87 is if the compiler realizes that it's negating the value and replaces the next add with a subtract.)
Use volatile
if you're compiling with -O3
and not using std::atomic
, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.
Replace local vars with members of a big struct, so you can control the memory layout.
Use arrays in the struct for padding (and storing random numbers, to justify their existence).
Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.
Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset, so the check can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see
Jes*_*uhl 34
您可以采取一些措施使事情尽可能地糟糕:
编译i386架构的代码.这将阻止使用SSE和更新的指令并强制使用x87 FPU.
std::atomic
到处使用变量.由于编译器被迫在整个地方插入内存屏障,这将使它们非常昂贵.这是一个无能为力的人可能会"确保线程安全"的事情.
确保以最糟糕的方式访问内存以供预取程序预测(列主要与行主要).
为了使你的变量更加昂贵你可以通过分配它们来确保它们都具有"动态存储持续时间"(堆分配),new
而不是让它们具有"自动存储持续时间"(堆栈分配).
确保你分配的所有内存都是非常奇怪的对齐,并且一定要避免分配大页面,因为这样做会使TLB效率太高.
无论你做什么,都不要在启用编译器优化器的情况下构建代码.并确保启用最具表现力的调试符号(不会使代码运行速度变慢,但会浪费一些额外的磁盘空间).
注意:这个答案基本上只是总结了我对@Peter Cordes已经纳入他非常好的答案的评论.建议如果你只有备用的话,他会得到你的支持:)
Mic*_*has 10
您可以long double
用于计算.在x86上,它应该是80位格式.只有传统的x87 FPU才支持此功能.
x87 FPU的缺点很少:
迟到的答案,但我认为我们滥用链表和 TLB 还不够。
使用 mmap 分配您的节点,以便您主要使用地址的 MSB。这应该会导致很长的 TLB 查找链,一个页面是 12 位,剩下 52 位用于翻译,或者每次必须遍历大约 5 个级别。幸运的是,他们每次都必须进入内存进行 5 级查找和 1 次内存访问才能到达您的节点,顶级很可能在某个地方的缓存中,因此我们可以希望获得 5* 内存访问。放置节点,使其跨越最坏的边界,以便读取下一个指针将导致另外 3-4 次翻译查找。由于大量的翻译查找,这也可能完全破坏缓存。此外,虚拟表的大小可能会导致大部分用户数据被分页到磁盘中以花费额外的时间。
从单链表中读取时,请确保每次从链表的开头读取,以导致读取单个数字的最大延迟。
归档时间: |
|
查看次数: |
44602 次 |
最近记录: |