取消优化英特尔Sandybridge系列CPU中管道的程序

Question

取消优化英特尔Sandybridge系列CPU中管道的程序

Cow*_*gun 313 c++ optimization x86 intel cpu-architecture

我一直在绞尽脑汁想要完成这项任务一周,我希望有人能带领我走向正确的道路.让我从教师的指示开始:

您的作业与我们的第一个实验作业相反,即优化素数计划.你在这个任务中的目的是使程序失望,即让它运行得更慢.这两个都是CPU密集型程序.他们需要几秒钟才能在我们的实验室电脑上运行.您可能无法更改算法.

要取消优化程序,请使用您对英特尔i7管道如何运行的了解.想象一下重新排序指令路径以引入WAR,RAW和其他危险的方法.想一想最小化缓存有效性的方法.恶魔无能.

该作业选择了Whetstone或Monte-Carlo程序.缓存有效性评论大多只适用于Whetstone,但我选择了Monte-Carlo模拟程序:

// Un-modified baseline for pessimization, as given in the assignment
#include <algorithm>    // Needed for the "max" function
#include <cmath>
#include <iostream>

// A simple implementation of the Box-Muller algorithm, used to generate
// gaussian random numbers - necessary for the Monte Carlo method below
// Note that C++11 actually provides std::normal_distribution<> in 
// the <random> library, which can be used instead of this function
double gaussian_box_muller() {
  double x = 0.0;
  double y = 0.0;
  double euclid_sq = 0.0;

  // Continue generating two uniform random variables
  // until the square of their "euclidean distance" 
  // is less than unity
  do {
    x = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    y = 2.0 * rand() / static_cast<double>(RAND_MAX)-1;
    euclid_sq = x*x + y*y;
  } while (euclid_sq >= 1.0);

  return x*sqrt(-2*log(euclid_sq)/euclid_sq);
}

// Pricing a European vanilla call option with a Monte Carlo method
double monte_carlo_call_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(S_cur - K, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

// Pricing a European vanilla put option with a Monte Carlo method
double monte_carlo_put_price(const int& num_sims, const double& S, const double& K, const double& r, const double& v, const double& T) {
  double S_adjust = S * exp(T*(r-0.5*v*v));
  double S_cur = 0.0;
  double payoff_sum = 0.0;

  for (int i=0; i<num_sims; i++) {
    double gauss_bm = gaussian_box_muller();
    S_cur = S_adjust * exp(sqrt(v*v*T)*gauss_bm);
    payoff_sum += std::max(K - S_cur, 0.0);
  }

  return (payoff_sum / static_cast<double>(num_sims)) * exp(-r*T);
}

int main(int argc, char **argv) {
  // First we create the parameter list                                                                               
  int num_sims = 10000000;   // Number of simulated asset paths                                                       
  double S = 100.0;  // Option price                                                                                  
  double K = 100.0;  // Strike price                                                                                  
  double r = 0.05;   // Risk-free rate (5%)                                                                           
  double v = 0.2;    // Volatility of the underlying (20%)                                                            
  double T = 1.0;    // One year until expiry                                                                         

  // Then we calculate the call/put values via Monte Carlo                                                                          
  double call = monte_carlo_call_price(num_sims, S, K, r, v, T);
  double put = monte_carlo_put_price(num_sims, S, K, r, v, T);

  // Finally we output the parameters and prices                                                                      
  std::cout << "Number of Paths: " << num_sims << std::endl;
  std::cout << "Underlying:      " << S << std::endl;
  std::cout << "Strike:          " << K << std::endl;
  std::cout << "Risk-Free Rate:  " << r << std::endl;
  std::cout << "Volatility:      " << v << std::endl;
  std::cout << "Maturity:        " << T << std::endl;

  std::cout << "Call Price:      " << call << std::endl;
  std::cout << "Put Price:       " << put << std::endl;

  return 0;
}

Run Code Online (Sandbox Code Playgroud)

我所做的更改似乎将代码运行时间增加了一秒,但我不完全确定在不添加代码的情况下我可以更改以停止管道.指向正确的方向将是非常棒的,我感谢任何回应.

更新:执行此任务的教授发布了一些细节

亮点是:

这是社区学院的第二学期建筑课(使用Hennessy和Patterson教科书).
实验室计算机有Haswell CPU
学生们已接触到CPUID指令以及如何确定缓存大小,以及内在函数和CLFLUSH指令.
允许任何编译器选项,因此是内联asm.
编写自己的平方根算法被宣布为在苍白之外

Cowmoogun对元线程的评论表明,目前尚不清楚编译器优化可能是其中的一部分,并假设-O0,并且运行时间增加17%是合理的.

所以听起来这个任务的目标是让学生重新排序现有的工作,以减少指令级并行性或类似的事情,但人们深入研究并学到更多东西并不是一件坏事.

请记住,这是一个计算机架构问题,而不是关于如何使C++变得缓慢的问题.

Answer 1

Pet*_*des 399

Important background reading: Agner Fog's microarch pdf, and probably also Ulrich Drepper's What Every Programmer Should Know About Memory. See also the other links in the x86 tag wiki, especially Intel's optimization manuals, and David Kanter's analysis of the Haswell microarchitecture, with diagrams.

Very cool assignment; much better than the ones I've seen where students were asked to optimize some code for gcc -O0, learning a bunch of tricks that don't matter in real code. In this case, you're being asked to learn about the CPU pipeline and use that to guide your de-optimization efforts, not just blind guessing. The most fun part of this one is justifying each pessimization with "diabolical incompetence", not intentional malice.

Problems with the assignment wording and code:

The uarch-specific options for this code are limited. It doesn't use any arrays, and much of the cost is calls to exp/log library functions. There isn't an obvious way to have more or less instruction-level parallelism, and the loop-carried dependency chain is very short.

I'd love to see an answer that attempted to get a slowdown from re-arranging the expressions to change the dependencies, to reduce ILP just from dependencies (hazards). I haven't attempted it.

Intel Sandybridge-family CPUs are aggressive out-of-order designs that spend lots of transistors and power to find parallelism and avoid hazards (dependencies) that would trouble a classic RISC in-order pipeline. Usually the only traditional hazards that slow it down are RAW "true" dependencies that cause throughput to be limited by latency.

WAR and WAW hazards for registers are pretty much not an issue, thanks to register renaming. (except for popcnt/lzcnt/tzcnt, which have a false dependency their destination on Intel CPUs, even though it's write-only. i.e. WAW being handled as a RAW hazard + a write). For memory ordering, modern CPUs use store queues to delay commit into cache until retirement, also avoiding WAR and WAW hazards.

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? has more about register renaming and hiding FMA latency in an FP dot product loop.

The "i7" brand-name was introduced with Nehalem (successor to Core2), and some Intel manuals even say "Core i7" when they seem to mean Nehalem, but they kept the "i7" branding for Sandybridge and later microarchitectures. SnB is when the P6-family evolved into a new species, the SnB-family. In many ways, Nehalem has more in common with Pentium III than with Sandybridge (e.g. register read stalls and ROB-read stalls don't happen on SnB, because it changed to using a physical register file. Also a uop cache and a different internal uop format). The term "i7 architecture" is not useful, because it makes little sense to group the SnB-family with Nehalem but not Core2. (Nehalem did introduce the shared inclusive L3 cache architecture for connecting multiple cores together, though. And also integrated GPUs. So chip-level, the naming makes more sense.)

Summary of the good ideas that diabolical incompetence can justify

Even the diabolically incompetent are unlikely to add obviously useless work or an infinite loop, and making a mess with C++/Boost classes is beyond the scope of the assignment.

Multi-thread with a single shared std::atomic<uint64_t> loop counter, so the right total number of iterations happen. Atomic uint64_t is especially bad with -m32 -march=i586. For bonus points, arrange for it to be misaligned, and crossing a page boundary with an uneven split (not 4:4).
某些其他非原子变量的错误共享 - >内存顺序错误推测管道清除,以及额外的缓存未命中.
而不是使用-FP变量,使用0x80对高字节进行异或,以翻转符号位,从而导致存储转发停顿.
每次迭代的时间独立,甚至更重的东西RDTSC.例如CPUID/ RDTSC或进行系统调用的时间函数.序列化指令本质上是管道不友好的.
更改乘以常数除以它们的倒数("为了便于阅读"). div很慢而且没有完全流水线化.
Vectorize the multiply/sqrt with AVX (SIMD), but fail to use vzeroupper before calls to scalar math-library exp() and log() functions, causing AVX<->SSE transition stalls.
Store the RNG output in a linked list, or in arrays which you traverse out of order. Same for the result of each iteration, and sum at the end.

Also covered in this answer but excluded from the summary: suggestions that would be just as slow on a non-pipelined CPU, or that don't seem to be justifiable even with diabolical incompetence. e.g. many gimp-the-compiler ideas that produce obviously different/worse asm.

Multi-thread badly

也许使用OpenMP来进行多线程循环,迭代次数很少,而且开销比速度增益更多.你的monte-carlo代码有足够的并行性来实际获得加速,尤其是.如果我们成功地使每次迭代变慢.(每个线程计算一个部分payoff_sum,最后添加). #omp parallel在那个循环上可能是一个优化,而不是悲观.

Multi-thread but force both threads to share the same loop counter (with atomic increments so the total number of iterations is correct). This seems diabolically logical. This means using a static variable as a loop counter. This justifies use of atomic for loop counters, and creates actual cache-line ping-ponging (as long as the threads don't run on the same physical core with hyperthreading; that might not be as slow). Anyway, this is much slower than the un-contended case for lock inc. And lock cmpxchg8b to atomically increment a contended uint64_t on a 32bit system will have to retry in a loop instead of having the hardware arbitrate an atomic inc.

Also create false sharing, where multiple threads keep their private data (e.g. RNG state) in different bytes of the same cache line. (Intel tutorial about it, including perf counters to look at). There's a microarchitecture-specific aspect to this: Intel CPUs speculate on memory mis-ordering not happening, and there's a memory-order machine-clear perf event to detect this, at least on P4. The penalty might not be as large on Haswell. As that link points out, a locked instruction assumes this will happen, avoiding mis-speculation. A normal load speculates that other cores won't invalidate a cache line between when the load executes and when it retires in program-order (除非你使用pause).没有locked指令的真正共享通常是一个错误.将非原子共享循环计数器与原子情况进行比较会很有趣.要真正地保持悲观,请保留共享原子循环计数器,并在相同或不同的高速缓存行中导致其他变量的错误共享.

随机的uarch特定的想法:

如果你可以引入任何不可预测的分支,那将大大减少代码.现代x86 CPU具有相当长的流水线,因此误预测需要大约15个周期(从uop缓存运行时).

依赖链:

我认为这是作业的预期部分之一.

Defeat the CPU's ability to exploit instruction-level parallelism by choosing an order of operations that has one long dependency chain instead of multiple short dependency chains. Compilers aren't allowed to change the order of operations for FP calculations unless you use -ffast-math, because that can change the results (as discussed below).

To really make this effective, increase the length of a loop-carried dependency chain. Nothing leaps out as obvious, though: The loops as written have very short loop-carried dependency chains: just an FP add. (3 cycles). Multiple iterations can have their calculations in-flight at once, because they can start well before the payoff_sum += at the end of the previous iteration. (log() and exp take many instructions, but not a lot more than Haswell's out-of-order window for finding parallelism: ROB size=192 fused-domain uops, and scheduler size=60 unfused-domain uops. As soon as execution of the current iteration progresses far enough to make room for instructions from the next iteration to issue, any parts of it that have their inputs ready (i.e. independent/separate dep chain) can start executing when older instructions leave the execution units free (e.g. because they're bottlenecked on latency, not throughput.).

The RNG state will almost certainly be a longer loop-carried dependency chain than the addps.

Use slower/more FP operations (esp. more division):

Divide by 2.0 instead of multiplying by 0.5, and so on. FP multiply is heavily pipelined in Intel designs, and has one per 0.5c throughput on Haswell and later. FP divsd/divpd is only partially pipelined. (Although Skylake has an impressive one per 4c throughput for divpd xmm, with 13-14c latency, vs not pipelined at all on Nehalem (7-22c)).

The do { ...; euclid_sq = x*x + y*y; } while (euclid_sq >= 1.0); is clearly testing for a distance, so clearly it would be proper to sqrt() it. :P (sqrt is even slower than div).

As @Paul Clayton suggests, rewriting expressions with associative/distributive equivalents can introduce more work (as long as you don't use -ffast-math to allow the compiler to re-optimize). (exp(T*(r-0.5*v*v)) could become exp(T*r - T*v*v/2.0). Note that while math on real numbers is associative, floating point math is not, even without considering overflow/NaN (which is why -ffast-math isn't on by default). See Paul's comment for a very hairy nested pow() suggestion.

If you can scale the calculations down to very small numbers, then FP math ops take ~120 extra cycles to trap to microcode when an operation on two normal numbers produces a denormal. See Agner Fog's microarch pdf for the exact numbers and details. This is unlikely since you have a lot of multiplies, so the scale factor would be squared and underflow all the way to 0.0. I don't see any way to justify the necessary scaling with incompetence (even diabolical), only intentional malice.

If you can use intrinsics (`<immintrin.h>`)

Use movnti to evict your data from cache. Diabolical: it's new and weakly-ordered, so that should let the CPU run it faster, right? Or see that linked question for a case where someone was in danger of doing exactly this (for scattered writes where only some of the locations were hot). clflush is probably impossible without malice.

Use integer shuffles between FP math operations to cause bypass delays.

Mixing SSE and AVX instructions without proper use of vzeroupper causes large stalls in pre-Skylake (and a different penalty in Skylake). Even without that, vectorizing badly can be worse than scalar (more cycles spent shuffling data into/out of vectors than saved by doing the add/sub/mul/div/sqrt operations for 4 Monte-Carlo iterations at once, with 256b vectors). add/sub/mul execution units are fully pipelined and full-width, but div and sqrt on 256b vectors aren't as fast as on 128b vectors (or scalars), so the speedup isn't dramatic for double.

exp() and log() don't have hardware support, so that part would require extracting vector elements back to scalar and calling the library function separately, then shuffling the results back into a vector. libm is typically compiled to only use SSE2, so will use the legacy-SSE encodings of scalar math instructions. If your code uses 256b vectors and calls exp without doing a vzeroupper first, then you stall. After returning, an AVX-128 instruction like vmovsd to set up the next vector element as an arg for exp will also stall. And then exp() will stall again when it runs an SSE instruction. This is exactly what happened in this question, causing a 10x slowdown. (Thanks @ZBoson).

See also Nathan Kurz's experiments with Intel's math lib vs. glibc for this code. Future glibc will come with vectorized implementations of exp() and so on.

If targeting pre-IvB, or esp. Nehalem, try to get gcc to cause partial-register stalls with 16bit or 8bit operations followed by 32bit or 64bit operations. In most cases, gcc will use movzx after an 8 or 16bit operation, but here's a case where gcc modifies ah and then reads ax

With (inline) asm:

With (inline) asm, you could break the uop cache: A 32B chunk of code that doesn't fit in three 6uop cache lines forces a switch from the uop cache to the decoders. An incompetent ALIGN using many single-byte nops instead of a couple long nops on a branch target inside the inner loop might do the trick. Or put the alignment padding after the label, instead of before. :P This only matters if the frontend is a bottleneck, which it won't be if we succeeded at pessimizing the rest of the code.

Use self-modifying code to trigger pipeline clears (aka machine-nukes).

LCP stalls from 16bit instructions with immediates too large to fit in 8 bits are unlikely to be useful. The uop cache on SnB and later means you only pay the decode penalty once. On Nehalem (the first i7), it might work for a loop that doesn't fit in the 28 uop loop buffer. gcc will sometimes generate such instructions, even with -mtune=intel and when it could have used a 32bit instruction.

A common idiom for timing is CPUID(to serialize) then RDTSC. Time every iteration separately with a CPUID/RDTSC to make sure the RDTSC isn't reordered with earlier instructions, which will slow things down a lot. (In real life, the smart way to time is to time all the iterations together, instead of timing each separately and adding them up).

Cause lots of cache misses and other memory slowdowns

Use a union { double d; char a[8]; } for some of your variables. Cause a store-forwarding stall by doing a narrow store (or Read-Modify-Write) to just one of the bytes. (That wiki article also covers a lot of other microarchitectural stuff for load/store queues). e.g. flip the sign of a double using XOR 0x80 on just the high byte, instead of a - operator. The diabolically incompetent developer may have heard that FP is slower than integer, and thus try to do as much as possible using integer ops. (A very good compiler targeting FP math in SSE registers may possibly compile this to an xorps with a constant in another xmm register, but the only way this isn't terrible for x87 is if the compiler realizes that it's negating the value and replaces the next add with a subtract.)

Use volatile if you're compiling with -O3 and not using std::atomic, to force the compiler to actually store/reload all over the place. Global variables (instead of locals) will also force some stores/reloads, but the C++ memory model's weak ordering doesn't require the compiler to spill/reload to memory all the time.

Replace local vars with members of a big struct, so you can control the memory layout.

Use arrays in the struct for padding (and storing random numbers, to justify their existence).

Choose your memory layout so everything goes into a different line in the same "set" in the L1 cache. It's only 8-way associative, i.e. each set has 8 "ways". Cache lines are 64B.

Even better, put things exactly 4096B apart, since loads have a false dependency on stores to different pages but with the same offset within a page. Aggressive out-of-order CPUs use Memory Disambiguation to figure out when loads and stores can be reordered without changing the results, and Intel's implementation has false-positives that prevent loads from starting early. Probably they only check bits below the page offset, so the check can start before the TLB has translated the high bits from a virtual page to a physical page. As well as Agner's guide, see

其中一些建议是如此恶魔般无能,我不得不与教授交谈,看看现在7分钟的运行时间是否太多,以至于他不想坐下来验证输出.仍在使用这个,这可能是我在项目中最有趣的. (17认同)

@JesperJuhl:是的,我会买那个理由."恶魔无能"是一个如此精彩的短语:) (10认同)

什么？没有互斥体？有两百万个线程同时运行互斥锁保护每个单独的计算(以防万一!)将使这个星球上最快的超级计算机瘫痪.也就是说,我确实喜欢这种恶魔无能的回答. (4认同)

将常数乘以常数除以常数的倒数可能会适度地降低性能(至少如果一个人没有试图超越-O3 -fastmath).类似地使用关联性来增加工作量(`exp(T*(r-0.5*v*v))`变为`exp(T*r - T*v*v/2.0)`;`exp(sqrt(v*v*) T)*gauss_bm)`成为`exp(sqrt(v)*sqrt(v)*sqrt(T)*gauss_bm)`).相关性(和泛化)也可以将`exp(T*r - T*v*v/2.0)`转换为`pow((pow(e_value,T),r)/ pow(pow(pow((pow(e_value, T),v),v)), - 2.0)[或类似的东西].这样的数学技巧并不算作微架构的去优化. (2认同)

我真的很感激这个回应，Agner's Fog 帮了大忙。我打算让这个消化一下，今天下午开始工作。就实际了解正在发生的事情而言，这可能是最有用的作业。 (2认同)

@Nicholas:我基本上浏览了可能的档位列表(来自Agner Fog的microarch pdf),并考虑如何恶魔般地证明将它们引入此代码中. (2认同)

这是我见过的最大答案. (2认同)

这篇文章现在首先出现在我谷歌“*恶魔般的无能*”时：) (2认同)

Answer 2

Jes*_*uhl 34

您可以采取一些措施使事情尽可能地糟糕:

编译i386架构的代码.这将阻止使用SSE和更新的指令并强制使用x87 FPU.
std::atomic到处使用变量.由于编译器被迫在整个地方插入内存屏障,这将使它们非常昂贵.这是一个无能为力的人可能会"确保线程安全"的事情.
确保以最糟糕的方式访问内存以供预取程序预测(列主要与行主要).
为了使你的变量更加昂贵你可以通过分配它们来确保它们都具有"动态存储持续时间"(堆分配),new而不是让它们具有"自动存储持续时间"(堆栈分配).
确保你分配的所有内存都是非常奇怪的对齐,并且一定要避免分配大页面,因为这样做会使TLB效率太高.
无论你做什么,都不要在启用编译器优化器的情况下构建代码.并确保启用最具表现力的调试符号(不会使代码运行速度变慢,但会浪费一些额外的磁盘空间).

注意:这个答案基本上只是总结了我对@Peter Cordes已经纳入他非常好的答案的评论.建议如果你只有备用的话,他会得到你的支持:)

我对其中一些问题的主要反对意见是问题:*要优化程序,**使用你对英特尔i7管道运行方式的了解**.*我不觉得有任何关于x87的特定问题,或者`std :: atomic`,或动态分配的额外间接级别.它们在Atom或K8上的速度也会很慢.仍然支持,但这就是为什么我拒绝你的一些建议. (9认同)

Answer 3

Mic*_*has 10

您可以long double用于计算.在x86上,它应该是80位格式.只有传统的x87 FPU才支持此功能.

x87 FPU的缺点很少:

缺少SIMD,可能需要更多说明.
基于堆栈,超级标量和流水线架构存在问题.
独立且非常小的寄存器集可能需要从其他寄存器和更多存储器操作进行更多转换.
在Core i7上有3个用于SSE的端口和2个用于x87的端口,处理器可以执行较少的并行指令.

对于标量数学,x87数学指令本身只是稍微慢一些.但是,存储/加载10byte操作数要慢得多,而x87基于堆栈的设计往往需要额外的指令(如`fxch`).使用`-ffast-math`,一个好的编译器可能会对monte-carlo循环进行矢量化,而x87会阻止它. (3认同)
另请注意,Windows x86-64 ABI有64位`long double`,即它仍然只是'double`.不过,SysV ABI确实使用了80bit`long double`.另外,re:2:寄存器重命名暴露了堆栈寄存器中的并行性.基于堆栈的架构需要一些额外的指令,比如`fxchg`,esp.交错并行计算时.因此,如果没有内存往返,就更难以表达并行性,而不是uarch很难利用那里的内容.但是,您不需要从其他regs进行更多转换.不确定你是什么意思. (2认同)

Answer 4

Sur*_*urt 6

迟到的答案，但我认为我们滥用链表和 TLB 还不够。

使用 mmap 分配您的节点，以便您主要使用地址的 MSB。这应该会导致很长的 TLB 查找链，一个页面是 12 位，剩下 52 位用于翻译，或者每次必须遍历大约 5 个级别。幸运的是，他们每次都必须进入内存进行 5 级查找和 1 次内存访问才能到达您的节点，顶级很可能在某个地方的缓存中，因此我们可以希望获得 5* 内存访问。放置节点，使其跨越最坏的边界，以便读取下一个指针将导致另外 3-4 次翻译查找。由于大量的翻译查找，这也可能完全破坏缓存。此外，虚拟表的大小可能会导致大部分用户数据被分页到磁盘中以花费额外的时间。

从单链表中读取时，请确保每次从链表的开头读取，以导致读取单个数字的最大延迟。

归档时间：	9 年，6 月前
查看次数：	44602 次
最近记录：	6 年，7 月前

取消优化英特尔Sandybridge系列CPU中管道的程序

更新:执行此任务的教授发布了一些细节

Summary of the good ideas that diabolical incompetence can justify

Multi-thread badly

随机的uarch特定的想法:

依赖链:

Use slower/more FP operations (esp. more division):

If you can use intrinsics (<immintrin.h>)

With (inline) asm:

Cause lots of cache misses and other memory slowdowns

If you can use intrinsics (`<immintrin.h>`)