预取指令

Kar*_*uru 19 embedded assembly arm mips prefetch

看起来预取用法的一般逻辑是,如果代码忙于处理直到预取指令完成其操作,则可以添加预取.但是,似乎如果使用过多的预取指令,那么它会影响系统的性能.我发现我们需要先获得没有预取指令的工作代码.稍后我们需要在各种代码位置中进行预取指令的各种组合,并进行分析以确定由于预取而实际可能改进的代码位置.有没有更好的方法来确定应该使用预取指令的确切位置?

Pau*_*l R 17

In the majority of cases prefetch instructions are of little or no benefit, and can even be counter-productive in some cases. Most modern CPUs have an automatic prefetch mechanism which works well enough that adding software prefetch hints achieves little, or even interferes with automatic prefetch, and can actually reduce performance.

In some rare cases, such as when you are streaming large blocks of data on which you are doing very little actual processing, you may manage to hide some latency with software-initiated prefetching, but it's very hard to get it right - you need to start the prefetch several hundred cycles before you are going to be using the data - do it too late and you still get a cache miss, do it too early and your data may get evicted from cache before you are ready to use it. Often this will put the prefetch in some unrelated part of the code, which is bad for modularity and software maintenance. Worse still, if your architecture changes (new CPU, different clock speed, etc), such that DRAM access latency increases or decreases, you may need to move your prefetch instructions to another part of the code to keep them effective.

Anyway, if you feel you really must use prefetch, I recommend #ifdefs around any prefetch instructions so that you can compile your code with and without prefetch and see if it is actually helping (or hindering) performance, e.g.

#ifdef USE_PREFETCH
    // prefetch instruction(s)
#endif
Run Code Online (Sandbox Code Playgroud)

In general though, I would recommend leaving software prefetch on the back burner as a last resort micro-optimisation after you've done all the more productive and obvious stuff.

  • 根据我的经验,没有简单的方法,在大多数情况下,努力是不合理的.通过改进算法及其实现,关注缓存使用和内存访问模式,使用SIMD等,您可以获得更多优化"每次降压". (2认同)
  • @Paul R:Core2和Itanium都受益于预取.我没有消除提取延迟,但它确实减少了树搜索中的等待周期数. (2认同)
  • @Zan Lynx:可能会有一些情况会受益,但重要的是要记住,如果你有足够的内存带宽,你只能利用prefetch* - 在很多情况下(可能是大多数情况下)你用手动预取是指您从程序的其他部分带走带宽.但如果它在你的特定情况下适合你,那么很棒. (2认同)

cam*_*ccc 6

甚至考虑预取代码性能肯定已经是一个问题。

1:使用代码分析器。尝试在没有分析器的情况下使用预取是浪费时间。

2:每当您在关键位置发现异常缓慢的指令时,您就有了预取的候选者。通常,实际问题出在慢行之前的内存访问上,而不是分析器指示的慢行上。找出导致问题的内存访问(并不总是那么容易)并预取它。

3 再次运行您的分析器,看看它是否有任何不同。如果没有拿出来。有时,我以这种方式将循环速度提高了 300% 以上。如果您有一个以非顺序方式访问内存的循环,它通常是最有效的。

我完全不同意它在现代 CPU 上的用处不大,我发现完全相反,尽管在较旧的 CPU 上预取大约 100 条指令是最佳的,但现在我把这个数字更像是 500。