为memcpy增强了REP MOVSB

Z b*_*son 56 c x86 assembly gcc memcpy

我想使用增强的REP MOVSB(ERMSB)为自定义获得高带宽memcpy.

ERMSB引入了Ivy Bridge微体系结构.如果您不知道ERMSB是什么,请参阅英特尔优化手册中的"增强型REP MOVSB和STOSB操作(ERMSB)" 部分.

我知道直接执行此操作的唯一方法是使用内联汇编.我从https://groups.google.com/forum/#!topic/gnu.gcc.help/-Bmlm_EG_fE获得了以下功能

static inline void *__movsb(void *d, const void *s, size_t n) {
  asm volatile ("rep movsb"
                : "=D" (d),
                  "=S" (s),
                  "=c" (n)
                : "0" (d),
                  "1" (s),
                  "2" (n)
                : "memory");
  return d;
}
Run Code Online (Sandbox Code Playgroud)

然而,当我使用它时,带宽远小于memcpy. 使用我的i7-6700HQ(Skylake)系统,Ubuntu 16.10,DDR4 @ 2400 MHz双通道32 GB,GCC 6.2,__movsb获得15 GB/s并memcpy获得26 GB/s.

为什么带宽如此低REP MOVSB?我该怎么做才能改善它?

这是我用来测试它的代码.

//gcc -O3 -march=native -fopenmp foo.c
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stddef.h>
#include <omp.h>
#include <x86intrin.h>

static inline void *__movsb(void *d, const void *s, size_t n) {
  asm volatile ("rep movsb"
                : "=D" (d),
                  "=S" (s),
                  "=c" (n)
                : "0" (d),
                  "1" (s),
                  "2" (n)
                : "memory");
  return d;
}

int main(void) {
  int n = 1<<30;

  //char *a = malloc(n), *b = malloc(n);

  char *a = _mm_malloc(n,4096), *b = _mm_malloc(n,4096);
  memset(a,2,n), memset(b,1,n);

  __movsb(b,a,n);
  printf("%d\n", memcmp(b,a,n));

  double dtime;

  dtime = -omp_get_wtime();
  for(int i=0; i<10; i++) __movsb(b,a,n);
  dtime += omp_get_wtime();
  printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);

  dtime = -omp_get_wtime();
  for(int i=0; i<10; i++) memcpy(b,a,n);
  dtime += omp_get_wtime();
  printf("dtime %f, %.2f GB/s\n", dtime, 2.0*10*1E-9*n/dtime);  
}
Run Code Online (Sandbox Code Playgroud)

我感兴趣的原因rep movsb是基于这些评论

请注意,在Ivybridge和Haswell上,缓冲区大到适合MLC,您可以使用rep movsb击败movntdqa; movntdqa引入RFO到LLC,rep movsb没有...当在Ivybridge和Haswell上流式传输到内存时,rep movsb比movntdqa快得多(但要注意前Ivybridge它很慢!)

这个memcpy实现中缺少什么/次优?


以下是我在tinymembnech的同一系统上的结果.

 C copy backwards                                     :   7910.6 MB/s (1.4%)
 C copy backwards (32 byte blocks)                    :   7696.6 MB/s (0.9%)
 C copy backwards (64 byte blocks)                    :   7679.5 MB/s (0.7%)
 C copy                                               :   8811.0 MB/s (1.2%)
 C copy prefetched (32 bytes step)                    :   9328.4 MB/s (0.5%)
 C copy prefetched (64 bytes step)                    :   9355.1 MB/s (0.6%)
 C 2-pass copy                                        :   6474.3 MB/s (1.3%)
 C 2-pass copy prefetched (32 bytes step)             :   7072.9 MB/s (1.2%)
 C 2-pass copy prefetched (64 bytes step)             :   7065.2 MB/s (0.8%)
 C fill                                               :  14426.0 MB/s (1.5%)
 C fill (shuffle within 16 byte blocks)               :  14198.0 MB/s (1.1%)
 C fill (shuffle within 32 byte blocks)               :  14422.0 MB/s (1.7%)
 C fill (shuffle within 64 byte blocks)               :  14178.3 MB/s (1.0%)
 ---
 standard memcpy                                      :  12784.4 MB/s (1.9%)
 standard memset                                      :  30630.3 MB/s (1.1%)
 ---
 MOVSB copy                                           :   8712.0 MB/s (2.0%)
 MOVSD copy                                           :   8712.7 MB/s (1.9%)
 SSE2 copy                                            :   8952.2 MB/s (0.7%)
 SSE2 nontemporal copy                                :  12538.2 MB/s (0.8%)
 SSE2 copy prefetched (32 bytes step)                 :   9553.6 MB/s (0.8%)
 SSE2 copy prefetched (64 bytes step)                 :   9458.5 MB/s (0.5%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  13103.2 MB/s (0.7%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  13179.1 MB/s (0.9%)
 SSE2 2-pass copy                                     :   7250.6 MB/s (0.7%)
 SSE2 2-pass copy prefetched (32 bytes step)          :   7437.8 MB/s (0.6%)
 SSE2 2-pass copy prefetched (64 bytes step)          :   7498.2 MB/s (0.9%)
 SSE2 2-pass nontemporal copy                         :   3776.6 MB/s (1.4%)
 SSE2 fill                                            :  14701.3 MB/s (1.6%)
 SSE2 nontemporal fill                                :  34188.3 MB/s (0.8%)
Run Code Online (Sandbox Code Playgroud)

请注意,在我的系统SSE2 copy prefetched上也比它快MOVSB copy.


在我原来的测试中,我没有禁用涡轮增压器.我禁用了涡轮增压器并再次进行测试,但似乎没有太大的区别.但是,改变电源管理确实会产生很大的不同.

当我做

sudo cpufreq-set -r -g performance
Run Code Online (Sandbox Code Playgroud)

我有时看到超过20 GB/s rep movsb.

sudo cpufreq-set -r -g powersave
Run Code Online (Sandbox Code Playgroud)

我看到的最好的是大约17 GB/s.但memcpy似乎对电源管理并不敏感.


我检查了使用和不使用SpeedStep的频率(使用turbostat),使用和使用for,1核心负载和4核心负载.我运行了Intel的MKL密集矩阵乘法来创建一个负载并设置线程数.以下是结果表(以GHz为单位的数字).performancepowersaveOMP_SET_NUM_THREADS

              SpeedStep     idle      1 core    4 core
powersave     OFF           0.8       2.6       2.6
performance   OFF           2.6       2.6       2.6
powersave     ON            0.8       3.5       3.1
performance   ON            3.5       3.5       3.1
Run Code Online (Sandbox Code Playgroud)

这表明powersave即使禁用SpeedStep,CPU仍然会降低到空闲频率0.8 GHz.只有performance没有SpeedStep才能使CPU以恒定频率运行.

我使用eg sudo cpufreq-set -r performance(因为cpufreq-set给出了奇怪的结果)来改变功率设置.这会让Turbo重新开启,所以我不得不在之后禁用涡轮增压器.

Bee*_*ope 73

这是一个非常接近我的心和最近调查的话题,所以我将从几个角度来看它:历史,一些技术说明(主要是学术性的),我的盒子上的测试结果,最后尝试回答你的实际问题何时何地rep movsb可能有意义.

部分地,这是一个分享结果调用 - 如果你可以运行Tinymembench并分享结果以及你的CPU和RAM配置的细节,那将是很棒的.特别是如果你有一个4通道设置,一个常春藤桥盒,一个服务器盒等.

历史和官方建议

快速字符串复制指令的性能历史是一个阶梯性的事情 - 即,停滞性能的时期与大的升级交替,使得它们与竞争方法相比甚至更快.例如,Nehalem(主要针对启动开销)和Ivy Bridge(大多数目标是大型副本的总吞吐量)的性能有所提升.您可以在此主题中找到有关实施rep movs英特尔工程师指令的困难的十年见解.

例如,在引入Ivy Bridge之前的指南中,典型的建议是避免它们或非常小心地使用它们1.

The current (well, June 2016) guide has a variety of confusing and somewhat inconsistent advice, such as2:

The specific variant of the implementation is chosen at execution time based on data layout, alignment and the counter (ECX) value. For example, MOVSB/STOSB with the REP prefix should be used with counter value less than or equal to three for best performance.

So for copies of 3 or less bytes? You don't need a rep prefix for that in the first place, since with a claimed startup latency of ~9 cycles you are almost certainly better off with a simple DWORD or QWORD mov with a bit of bit-twiddling to mask off the unused bytes (or perhaps with 2 explicit byte, word movs if you know the size is exactly three).

They go on to say:

String MOVE/STORE instructions have multiple data granularities. For efficient data movement, larger data granularities are preferable. This means better efficiency can be achieved by decomposing an arbitrary counter value into a number of double words plus single byte moves with a count value less than or equal to 3.

This certainly seems wrong on current hardware with ERMSB where rep movsb is at least as fast, or faster, than the movd or movq variants for large copies.

In general, that section (3.7.5) of the current guide contains a mix of reasonable and badly obsolete advice. This is common throughput the Intel manuals, since they are updated in an incremental fashion for each architecture (and purport to cover nearly two decades worth of architectures even in the current manual), and old sections are often not updated to replace or make conditional advice that doesn't apply to the current architecture.

They then go on to cover ERMSB explicitly in section 3.7.6.

I won't go over the remaining advice exhaustively, but I'll summarize the good parts in the "why use it" below.

Other important claims from the guide are that on Haswell, rep movsb has been enhanced to use 256-bit operations internally.

Technical Considerations

这只是rep实现角度来看指令所具有的潜在优缺点的快速总结.

优点 rep movs

  1. 当发出repmovs指令时,CPU 知道要传输已知大小的整个块.这可以帮助它以不能使用离散指令的方式优化操作,例如:

    • 当知道整个缓存行时,避免RFO请求将被覆盖.
    • 立即和准确地发出预取​​请求.硬件预取在检测类似memcpy模式方面做得很好,但它仍需要几次读取,并且将"超预取"许多高速缓存行超出复制区域的末尾.rep movsb确切地知道区域大小并且可以准确地预取.
  2. 显然,不能保证在3个单独的商店之间进行排序,rep movs这可以帮助简化一致性流量和简单的块移动的其他方面,而不是简单的mov指令必须服从相当严格的内存排序4.

  3. 原则上,该rep movs指令可以利用ISA中未公开的各种架构技巧.例如,体系结构可能具有更广泛的内部数据路径,ISA暴露5并且rep movs可以在内部使用它.

缺点

  1. rep movsb必须实现一个特定的语义,它可能比底层的软件需求更强大.特别是,memcpy禁止重叠区域,因此可以忽略这种可能性,但rep movsb允许它们并且必须产生预期结果.当前的实现主要影响启动开销,但可能不会影响大块吞吐量.同样,rep movsb即使您实际上正在使用它来复制大块的2个大功率的倍数,也必须支持字节粒度副本.

  2. The software may have information about alignment, copy size and possible aliasing that cannot be communicated to the hardware if using rep movsb. Compilers can often determine the alignment of memory blocks6 and so can avoid much of the startup work that rep movs must do on every invocation.

Test Results

Here are test results for many different copy methods from tinymembench on my i7-6700HQ at 2.6 GHz (too bad I have the identical CPU so we aren't getting a new data point...):

 C copy backwards                                     :   8284.8 MB/s (0.3%)
 C copy backwards (32 byte blocks)                    :   8273.9 MB/s (0.4%)
 C copy backwards (64 byte blocks)                    :   8321.9 MB/s (0.8%)
 C copy                                               :   8863.1 MB/s (0.3%)
 C copy prefetched (32 bytes step)                    :   8900.8 MB/s (0.3%)
 C copy prefetched (64 bytes step)                    :   8817.5 MB/s (0.5%)
 C 2-pass copy                                        :   6492.3 MB/s (0.3%)
 C 2-pass copy prefetched (32 bytes step)             :   6516.0 MB/s (2.4%)
 C 2-pass copy prefetched (64 bytes step)             :   6520.5 MB/s (1.2%)
 ---
 standard memcpy                                      :  12169.8 MB/s (3.4%)
 standard memset                                      :  23479.9 MB/s (4.2%)
 ---
 MOVSB copy                                           :  10197.7 MB/s (1.6%)
 MOVSD copy                                           :  10177.6 MB/s (1.6%)
 SSE2 copy                                            :   8973.3 MB/s (2.5%)
 SSE2 nontemporal copy                                :  12924.0 MB/s (1.7%)
 SSE2 copy prefetched (32 bytes step)                 :   9014.2 MB/s (2.7%)
 SSE2 copy prefetched (64 bytes step)                 :   8964.5 MB/s (2.3%)
 SSE2 nontemporal copy prefetched (32 bytes step)     :  11777.2 MB/s (5.6%)
 SSE2 nontemporal copy prefetched (64 bytes step)     :  11826.8 MB/s (3.2%)
 SSE2 2-pass copy                                     :   7529.5 MB/s (1.8%)
 SSE2 2-pass copy prefetched (32 bytes step)          :   7122.5 MB/s (1.0%)
 SSE2 2-pass copy prefetched (64 bytes step)          :   7214.9 MB/s (1.4%)
 SSE2 2-pass nontemporal copy                         :   4987.0 MB/s
Run Code Online (Sandbox Code Playgroud)

Some key takeaways:

  • The rep movs methods are faster than all the other methods which aren't "non-temporal"7, and considerably faster than the "C" approaches which copy 8 bytes at a time.
  • The "non-temporal" methods are faster, by up to about 26% than the rep movs ones - but that's a much smaller delta than the one you reported (26 GB/s vs 15 GB/s = ~73%).
  • If you are not using non-temporal stores, using 8-byte copies from C is pretty much just as good as 128-bit wide SSE load/stores. That's because a good copy loop can generate enough memory pressure to saturate the bandwidth (e.g., 2.6 GHz*1 store/cycle*8 bytes = 26 GB/s for stores).
  • There are no explicit 256-bit algorithms in tinymembench (except probably the "standard" memcpy) but it probably doesn't matter due to the above note.
  • The increased throughput of the non-temporal store approaches over the temporal ones is about 1.45x, which is very close to the 1.5x you would expect if NT eliminates 1 out of 3 transfers (i.e., 1 read, 1 write for NT vs 2 reads, 1 write). The rep movs approaches lie in the middle.
  • The combination of fairly low memory latency and modest 2-channel bandwidth means this particular chip happens to be able to saturate its memory bandwidth from a single-thread, which changes the behavior dramatically.
  • rep movsd seems to use the same magic as rep movsb on this chip. That's interesting because ERMSB only explicitly targets movsb and earlier tests on earlier archs with ERMSB show movsb performing much faster than movsd. This is mostly academic since movsb is more general than movsd anyway.

Haswell

Looking at the Haswell results kindly provided by iwillnotexist in the comments, we see the same general trends (most relevant results extracted):

 C copy                                               :   6777.8 MB/s (0.4%)
 standard memcpy                                      :  10487.3 MB/s (0.5%)
 MOVSB copy                                           :   9393.9 MB/s (0.2%)
 MOVSD copy                                           :   9155.0 MB/s (1.6%)
 SSE2 copy                                            :   6780.5 MB/s (0.4%)
 SSE2 nontemporal copy                                :  10688.2 MB/s (0.3%)
Run Code Online (Sandbox Code Playgroud)

The rep movsb approach is still slower than the non-temporal memcpy, but only by about 14% here (compared to ~26% in the Skylake test). The advantage of the NT techniques above their temporal cousins is now ~57%, even a bit more than the theoretical benefit of the bandwidth reduction.

When should you use rep movs?

Finally a stab at your actual question: when or why should you use it? It draw on the above and introduces a few new ideas. Unfortunately there is no simple answer: you'll have to trade off various factors, including some which you probably can't even know exactly, such as future developments.

A note that the alternative to rep movsb may be the optimized libc memcpy (including copies inlined by the compiler), or it may be a hand-rolled memcpy version. Some of the benefits below apply only in comparison to one or the other of these alternatives (e.g., "simplicity" helps against a hand-rolled version, but not against built-in memcpy), but some apply to both.

Restrictions on available instructions

In some environments there is a restriction on certain instructions or using certain registers. For example, in the Linux kernel, use of SSE/AVX or FP registers is generally disallowed. Therefore most of the optimized memcpy variants cannot be used as they rely on SSE or AVX registers, and a plain 64-bit mov-based copy is used on x86. For these platforms, using rep movsb allows most of the performance of an optimized memcpy without breaking the restriction on SIMD code.

A more general example might be code that has to target many generations of hardware, and which doesn't use hardware-specific dispatching (e.g., using cpuid). Here you might be forced to use only older instruction sets, which rules out any AVX, etc. rep movsb might be a good approach here since it allows "hidden" access to wider loads and stores without using new instructions. If you target pre-ERMSB hardware you'd have to see if rep movsb performance is acceptable there, though...

Future Proofing

A nice aspect of rep movsb is that it can, in theory take advantage of architectural improvement on future architectures, without source changes, that explicit moves cannot. For example, when 256-bit data paths were introduced, rep movsb was able to take advantage of them (as claimed by Intel) without any changes needed to the software. Software using 128-bit moves (which was optimal prior to Haswell) would have to be modified and recompiled.

So it is both a software maintenance benefit (no need to change source) and a benefit for existing binaries (no need to deploy new binaries to take advantage of the improvement).

How important this is depends on your maintenance model (e.g., how often new binaries are deployed in practice) and a very difficult to make judgement of how fast these instructions are likely to be in the future. At least Intel is kind of guiding uses in this direction though, by committing to at least reasonable performance in the future (15.3.3.6):

REP MOVSB and REP STOSB will continue to perform reasonably well on future processors.

Overlapping with subsequent work

This benefit won't show up in a plain memcpy benchmark of course, which by definition doesn't have subsequent work to overlap, so the magnitude of the benefit would have to be carefully measured in a real-world scenario. Taking maximum advantage might require re-organization of the code surrounding the memcpy.

This benefit is pointed out by Intel in their optimization manual (section 11.16.3.4) and in their words:

When the count is known to be at least a thousand byte or more, using enhanced REP MOVSB/STOSB can provide another advantage to amortize the cost of the non-consuming code. The heuristic can be understood using a value of Cnt = 4096 and memset() as example:

• A 256-bit SIMD implementation of memset() will need to issue/execute retire 128 instances of 32- byte store operation with VMOVDQA, before the non-consuming instruction sequences can make their way to retirement.

• An instance of enhanced REP STOSB with ECX= 4096 is decoded as a long micro-op flow provided by hardware, but retires as one instruction. There are many store_data operation that must complete before the result of memset() can be consumed. Because the completion of store data operation is de-coupled from program-order retirement, a substantial part of the non-consuming code stream can process through the issue/execute and retirement, essentially cost-free if the non-consuming sequence does not compete for store buffer resources.

So Intel is saying that after all some uops the code after rep movsb has issued, but while lots of stores are still in flight and the rep movsb as a whole hasn't retired yet, uops from following instructions can make more progress through the out-of-order machinery than they could if that code came after a copy loop.

The uops from an explicit load and store loop all have to actually retire separately in program order. That has to happen to make room in the ROB for following uops.

There doesn't seem to be much detailed information about how very long microcoded instruction like rep movsb work, exactly. We don't know exactly how micro-code branches request a different stream of uops from the microcode sequencer, or how the uops retire. If the individual uops don't have to retire separately, perhaps the whole instruction only takes up one slot in the ROB?

When the front-end that feeds the OoO machinery sees a rep movsb instruction in the uop cache, it activates the Microcode Sequencer ROM (MS-ROM) to send microcode uops into the queue that feeds the issue/rename stage. It's probably not possible for any other uops to mix in with that and issue/execute8 while rep movsb is still issuing, but subsequent instructions can be fetched/decoded and issue right after the last rep movsb uop does, while some of the copy hasn't executed yet. This is only useful if at least some of your subsequent code doesn't depend on the result of the memcpy (which isn't unusual).

Now, the size of this benefit is limited: at most you can execute N instructions (uops actually) beyond the slow rep movsb instruction, at which point you'll stall, where N is the ROB size. With current ROB sizes of ~200 (192 on Haswell, 224 on Skylake), that's a maximum benefit of ~200 cycles of free work for subsequent code with an IPC of 1. In 200 cycles you can copy somewhere around 800 bytes at 10 GB/s, so for copies of that size you may get free work close to the cost of the copy (in a way making the copy free).

As copy sizes get much larger, however, the relative importance of this diminishes rapidly (e.g., if you are copying 80 KB instead, the free work is only 1% of the copy cost). Still, it is quite interesting for modest-sized copies.

Copy loops don't totally block subsequent instructions from executing, either. Intel does not go into detail on the size of the benefit, or on what kind of copies or surrounding code there is most benefit. (Hot or cold destination or source, high ILP or low ILP high-latency code after).

Code Size

The executed code size (a few bytes) is microscopic compared to a typical optimized memcpy routine. If performance is at all limited by i-cache (including uop cache) misses, the reduced code size might be of benefit.

Again, we can bound the magnitude of this benefit based on the size of the copy. I won't actually work it out numerically, but the intuition is that reducing the dynamic code size by B bytes can save at most C * B cache-misses, for some constant C. Every call to memcpy incurs the cache miss cost (or benefit) once, but the advantage of higher throughput scales with the number of bytes copied. So for large transfers, higher throughput will dominate the cache effects.

Again, this is not something that will show up in a plain benchmark, where the entire loop will no doubt fit in the uop cache. You'll need a real-world, in-place test to evaluate this effect.

Architecture Specific Optimization

You reported that on your hardware, rep movsb was considerably slower than the platform memcpy. However, even here there are reports of the opposite result on earlier hardware (like Ivy Bridge).

That's entirely plausible, since it seems that the string move operations get love periodically - but not every generation, so it may well be faster or at least tied (at which point it may win based on other advantages) on the architectures where it has been brought up to date, only to fall behind in subsequent hardware.

Quoting Andy Glew, who should know a thing or two about this after implementing these on the P6:

the big weakness of doing fast strings in microcode was [...] the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on.

In that case, it can be seen as just another "platform specific" optimization to apply in the typical every-trick-in-the-book memcpy routines you find in standard libraries and JIT compilers: but only for use on architectures where it is better. For JIT or AOT-compiled stuff this is easy, but for statically compiled binaries this does require platform specific dispatch, but that often already exists (sometimes implemented at link time), or the mtune argument can be used to make a static decision.

Simplicity

Even on Skylake, where it seems like it has fallen behind the absolute fastest non-temporal techniques, it is still faster than most approaches and is very simple. This means less time in validation, fewer mystery bugs, less time tuning and updating a monster memcpy implementation (or, conversely, less dependency on the whims of the standard library implementors if you rely on that).

Latency Bound Platforms

Memory throughput bound algorithms9 can actually be operating in two main overall regimes: DRAM bandwidth bound or concurrency/latency bound.

The first mode is the one that you are probably familiar with: the DRAM subsystem has a certain theoretic bandwidth that you can calculate pretty easily based on the number of channels, data rate/width and frequency. For example, my DDR4-2133 system with 2 channels has a max bandwidth of 2.133*8*2 = 34.1 GB/s, same as reported on ARK.

You won't sustain more than that rate from DRAM (and usually somewhat less due to various inefficiencies) added across all cores on the socket (i.e., it is a global limit for single-socket systems).

The other limit is imposed by how many concurrent requests a core can actually issue to the memory subsystem. Imagine if a core could only have 1 request in progress at once, for a 64-byte cache line - when the request completed, you could issue another. Assume also very fast 50ns memory latency. Then despite the large 34.1 GB/s DRAM bandwidth, you'd actually only get 64 bytes/50 ns = 1.28 GB/s, or less than 4% of the max bandwidth.

In practice, cores can issue more than one request at a time, but not an unlimited number. It is usually understood that there are only 10 line fill buffers per core between the L1 and the rest of the memory hierarchy, and perhaps 16 or so fill buffers between L2 and DRAM. Prefetching competes for the same resources, but at least helps reduce the effective latency. For more details look at any of the great posts Dr. Bandwidth has written on the topic, mostly on the Intel forums.

Still, most recent CPUs

  • @BeeOnRope:这是[我的结果](http://nominal-animal.net/answers/tinymembench-i5-6200U-HP-EliteBook-820-G3.txt); 该文件包含系统和编译器信息.它有ERMS支持,但结果表明它在这个系统上并不具有竞争力; 解释了我找到胜利考试的困难.另外..你会介意在你的答案中添加评论,即tinymembench只能进行64位对齐的复制和填充吗?尽管完全适用于此处提出的问题,但它严格地是现实世界应用程序中典型用例的子集. (3认同)
  • @MaximMasiutin - 关于分支预测的讨论在SO上可能值得一个完全独立的问题,但简短的回答是最新芯片的确切技术尚未公开,但你可能正在寻找与[TAGE]非常相似的东西(https://scholar.google.com/scholar?q=TAGE+branch+prediction)关于AMD和[perceptons](https://scholar.google.com/scholar?q=perceptron+branch+prediction)对AMD的看法.更一般地,我建议完全阅读[Agner](http://www.agner.org/optimize)的指南1,2和3. (3认同)
  • @PeterCordes - 是的,我似乎对`movsd`与`movsb`不一致,在某些地方声称他们在'erms`平台上有相同的表现,但是上面我说_早期用ERMSB测试早期拱的测试`movsb`的表现比`movsd`_快得多.这足够具体,我必须看到数据,但我在这个线程中找不到它.它可能来自[这些]之一(http://www.realworldtech.com/forum/?threadid=168200)[two](http://www.realworldtech.com/forum/?threadid=147985)big RWT上的线程,或者可能形成英特尔手册中的示例. (3认同)
  • 精确的行为通常无关紧要:假设除非您的分支序列遵循一些简单的(ish)重复模式,预测器将简单地预测它最常见的方向,因此您将支付~20个周期每次分支采用"其他"方式处罚.您可以使用`perf stat`和`perf record -e branch-miss:pp`在Linux上轻松检查应用程序中每个分支的实际性能(以及Windows上的等效内容). (2认同)
  • 例如,英特尔手册有_图3-4.长度高达2KB的Memcpy性能比较显示,在常春藤桥上的`rep movsd`(加上最后三个​​字节的尾随`movsb`)比`movsb`更难以达到256字节,此时斜率显示为是相同的.有一些Ivy Bridge的结果[这里](http://users.atw.hu/instlatx64/GenuineIntel00306A9_IvyBridge_InstLatX64.txt),显示`rep movsd`比`rep movsb`慢约3%,但可能是在测量误差范围内即使没有,也不大. (2认同)
  • 好吧,它不一定是一个保证 - 但实施专用的自定义memcpy的人可能非常清楚IVB之前的一切都非常不重要.在任何情况下,扫描http://instlatx64.atw.hu我甚至在IVB之前找不到任何半现代芯片,其中`rep movsb`更糟糕.这仅涵盖L1,但通常趋势是较大的副本会进一步减少差异.我确实找到了这个[旧Athlon](http://users.atw.hu/instlatx64/AuthenticAMD0020FB1_K8_Manchester_InstLatX64.txt),其中`rep movsb`差了4倍(更新的AMD芯片似乎更好). (2认同)
  • 值得注意的是,从27G周期的变化 - > 30G周期的16G更多指令意味着这些指令的IPC为16/3 = 5.33,因此肯定存在一些重叠(但它没有任何重叠)方式明确它比你从显式复制循环获得的重叠更好,正如英特尔声称的那样. (2认同)

Max*_*tin 10

增强的REP MOVSB(Ivy Bridge及更高版本)

Ivy Bridge微体系结构(2012年和2013年发布的处理器)引入了增强型REP MOVSB(我们仍然需要检查相应的位)并允许我们快速复制内存.

最便宜的后续处理器版本 - 2017年发布的Kaby Lake Celeron和Pentium,没有可用于快速内存复制的AVX,但仍然具有增强型REP MOVSB.

如果块大小至少为256字节,则REP MOVSB(ERMSB)仅比AVX复制或通用寄存器复制更快.对于低于64字节的块,它要慢很多,因为ERMSB中有很高的内部启动 - 大约35个周期.

请参阅"英特尔优化手册"第3.7.6节"增强REP MOVSB和STOSB操作(ERMSB)" http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia- 32体系结构优化-手册.pdf

  • 启动成本为35个周期;
  • 源地址和目标地址都必须与16字节边界对齐;
  • 源区域不应与目标区域重叠;
  • 长度必须是64的倍数才能产生更高的性能;
  • 方向必须向前(CLD).

正如我之前所说,当MOVSB长度至少为256字节时,它开始优于其他方法,但要看到AVX拷贝的明显优势,长度必须超过2048字节.

关于REP MOVSB与AVX副本的对齐效果,"英特尔手册"提供了以下信息:

  • 如果源缓冲区未对齐,则对ERMSB实现与128位AVX的影响相似;
  • 如果目标缓冲区未对齐,则对ERMSB实现的影响可能会降低25%,而相对于16字节对齐方案,memcpy的128位AVX实现可能仅降低5%.

我已经在64位以下的英特尔酷睿i5-6600上进行了测试,并且我将REP MOVSB memcpy()与简单的MOV RAX,[SRC]进行了比较; MOV [DST],数据适合L1缓存时的 RAX实现:

REP MOVSB memcpy():

 - 1622400000 data blocks of  32 bytes took 17.9337 seconds to copy;  2760.8205 MB/s
 - 1622400000 data blocks of  64 bytes took 17.8364 seconds to copy;  5551.7463 MB/s
 - 811200000 data blocks of  128 bytes took 10.8098 seconds to copy;  9160.5659 MB/s
 - 405600000 data blocks of  256 bytes took  5.8616 seconds to copy; 16893.5527 MB/s
 - 202800000 data blocks of  512 bytes took  3.9315 seconds to copy; 25187.2976 MB/s
 - 101400000 data blocks of 1024 bytes took  2.1648 seconds to copy; 45743.4214 MB/s
 - 50700000 data blocks of  2048 bytes took  1.5301 seconds to copy; 64717.0642 MB/s
 - 25350000 data blocks of  4096 bytes took  1.3346 seconds to copy; 74198.4030 MB/s
 - 12675000 data blocks of  8192 bytes took  1.1069 seconds to copy; 89456.2119 MB/s
 - 6337500 data blocks of  16384 bytes took  1.1120 seconds to copy; 89053.2094 MB/s
Run Code Online (Sandbox Code Playgroud)

MOV RAX ... memcpy():

 - 1622400000 data blocks of  32 bytes took  7.3536 seconds to copy;  6733.0256 MB/s
 - 1622400000 data blocks of  64 bytes took 10.7727 seconds to copy;  9192.1090 MB/s
 - 811200000 data blocks of  128 bytes took  8.9408 seconds to copy; 11075.4480 MB/s
 - 405600000 data blocks of  256 bytes took  8.4956 seconds to copy; 11655.8805 MB/s
 - 202800000 data blocks of  512 bytes took  9.1032 seconds to copy; 10877.8248 MB/s
 - 101400000 data blocks of 1024 bytes took  8.2539 seconds to copy; 11997.1185 MB/s
 - 50700000 data blocks of  2048 bytes took  7.7909 seconds to copy; 12710.1252 MB/s
 - 25350000 data blocks of  4096 bytes took  7.5992 seconds to copy; 13030.7062 MB/s
 - 12675000 data blocks of  8192 bytes took  7.4679 seconds to copy; 13259.9384 MB/s
Run Code Online (Sandbox Code Playgroud)

因此,即使在128位块上,REP MOVSB也比循环中的简单MOV RAX副本慢(未展开).ERMSB实现开始优于仅从256字节块开始的MOV RAX循环.

Nehalem及以后的正常(未增强)REP MOVS

令人惊讶的是,以前的架构(Nehalem和更高版本),还没有增强型REP MOVB,对于大型块具有相当快的REP MOVSD/MOVSQ(但不是REP MOVSB/MOVSW)实现,但不足以超大L1缓存.

英特尔优化手册(2.5.6 REP字符串增强)提供以下信息与Nehalem微体系结构 - 2009年和2010年发布的英特尔酷睿i5,i7和Xeon处理器有关.

REP MOVSB

如果ECX <4,则MOVSB的延迟为9个周期; 否则,ECX> 9的REP MOVSB具有50个周期的启动成本.

  • 细串(ECX <4):REP MOVSB的延迟为9个周期;
  • 小字符串(ECX介于4和9之间):英特尔手册中没有官方信息,可能超过9个周期但少于50个周期;
  • 长字符串(ECX> 9):50周期启动成本.

我的结论:REP MOVSB在Nehalem几乎没用.

MOVSW/MOVSD/MOVSQ

引用英特尔优化手册(2.5.6 REP String Enhancement):

  • 短串(ECX <= 12):REP MOVSW/MOVSD/MOVSQ的延迟约为20个周期.
  • 快速字符串(ECX> = 76:不包括REP MOVSB):处理器实现通过移动尽可能多的16字节数据来提供硬件优化.如果其中一个16字节数据传输跨越缓存行边界,则REP字符串延迟的延迟会有所不同:=无拆分:延迟包括大约40个周期的启动成本,每64个字节的数据增加4个周期.=缓存拆分:延迟包括大约35个周期的启动成本,每64个字节的数据增加6个周期.
  • 中间字符串长度:REP MOVSW/MOVSD/MOVSQ的延迟具有大约15个周期的启动成本加上word/dword/qword中数据移动的每次迭代的一个周期.

英特尔在这里似乎不正确.从上面的引用中我们了解到,对于非常大的内存块,REP MOVSW和REP MOVSD/MOVSQ一样快,但是测试表明只有REP MOVSD/MOVSQ很快,而REP MOVSW甚至比Nehalem和Westmere上的REP MOVSB慢. .

根据英特尔在手册中提供的信息,在以前的英特尔微体系结构(2008年之前)中,启动成本甚至更高.

结论:如果您只需要复制适合L1缓存的数据,那么复制64字节数据只需4个周期就可以了,而且您不需要使用XMM寄存器!

如果数据适合L1缓存,REP MOVSD/MOVSQ是适用于所有英特尔处理器(无需ERMSB)的通用解决方案

以下是当源和目标位于L1缓存中时,REP MOVS*的测试,其大小足以不受启动成本的严重影响,但不会超过L1缓存大小.资料来源:http://users.atw.hu/instlatx64/

约拿(2006-2008)

    REP MOVSB 10.91 B/c
    REP MOVSW 10.85 B/c
    REP MOVSD 11.05 B/c
Run Code Online (Sandbox Code Playgroud)

Nehalem(2009-2010)

    REP MOVSB 25.32 B/c
    REP MOVSW 19.72 B/c
    REP MOVSD 27.56 B/c
    REP MOVSQ 27.54 B/c
Run Code Online (Sandbox Code Playgroud)

Westmere(2010-2011)

    REP MOVSB 21.14 B/c
    REP MOVSW 19.11 B/c
    REP MOVSD 24.27 B/c
Run Code Online (Sandbox Code Playgroud)

Ivy Bridge(2012-2013) - 增强型REP MOVSB

    REP MOVSB 28.72 B/c
    REP MOVSW 19.40 B/c
    REP MOVSD 27.96 B/c
    REP MOVSQ 27.89 B/c
Run Code Online (Sandbox Code Playgroud)

SkyLake(2015-2016) - 增强型REP MOVSB

    REP MOVSB 57.59 B/c
    REP MOVSW 58.20 B/c
    REP MOVSD 58.10 B/c
    REP MOVSQ 57.59 B/c
Run Code Online (Sandbox Code Playgroud)

Kaby Lake(2016-2017) - 增强型REP MOVSB

    REP MOVSB 58.00 B/c
    REP MOVSW 57.69 B/c
    REP MOVSD 58.00 B/c
    REP MOVSQ 57.89 B/c
Run Code Online (Sandbox Code Playgroud)

如您所见,REP MOVS的实现与一个微体系结构有很大不同.在某些处理器上,例如Ivy Bridge - REP MOVSB速度最快,虽然比REP MOVSD/MOVSQ略快,但毫无疑问,自Nehalem以来所有处理器,REP MOVSD/MOVSQ运行良好 - 您甚至不需要"增强REP" MOVSB",因为在Ivy Bridge(2013)上使用了增强的REP MOVSB,REP MOVSD显示的每个时钟数据字节与Nehalem(2010)相同,没有Enhacnced REP MOVSB,而实际上REP MOVSB自SkyLake(2015)以来变得非常快 - 比常春藤桥快两倍.因此,CPUID中的这个增强的REP MOVSB位可能会让人感到困惑 - 它只表明REP MOVSB本身是可以的,但不是REP MOVS*更快.

最令人困惑的ERMBSB实现是在Ivy Bridge微体系结构上.是的,在非常旧的处理器上,在ERMSB之前,大块的REP MOVS*确实使用了常规代码无法使用的缓存协议功能(no-RFO).但是这个协议不再用于具有ERMSB的Ivy Bridge.根据Andy Glew对"为什么复杂的memcpy/memset优秀?"的回答的评论.根据Peter Cordes的回答,常规代码无法使用的缓存协议功能曾在旧版处理器上使用,但不再在Ivy Bridge上使用.并且有一个解释为什么REP MOVS*的启动成本如此之高:"选择和设置正确方法的巨大开销主要是由于缺少微码分支预测".还有一个有趣的说明,Pentium Pro(P6)在1996年实现了带有64位微码加载和存储的REP MOVS*以及无RFO缓存协议 - 它们不违反内存排序,与Ivy Bridge中的ERMSB不同.

放弃

  1. 此答案仅适用于源和目标数据适合L1缓存的情况.根据具体情况,应考虑内存访问(缓存等)的特殊性.在某些情况下,预取和NTI可能会提供更好的结果,尤其是在尚未具有增强型REP MOVSB的处理器上.即使在这些较旧的处理器上,REP MOVSD也可能使用了常规代码无法使用的缓存协议功能.
  2. 本答案中的信息仅与英特尔处理器有关,而与AMD等其他制造商的处理器无关,这些处理器可能有更好或更差的REP MOVS*指令实现.
  3. 我为了确认而提供了SkyLake和Kaby Lake的测试结果 - 这些架构具有相同的每指令周期数据.
  4. 所有产品名称,商标和注册商标均为其各自所有者的财产.

  • 如果我理解正确,你只需从instlatx64结果中删除仅限L1D的数字.所以结论真的是所有`movsb`,`movsd`,`movsq`在所有最近的_Intel_平台上的表现大致相同.最有趣的内容可能是"不要使用`movsw`".您不能与`mov`指令的显式循环(包括64位平台上的16字节移动,保证可用)进行比较,在许多情况下可能会更快.您不知道显示AMD平台上发生了什么,也不知道大小超过L1大小. (3认同)
  • 有趣的L1D中型缓冲数据.但这可能不是整个故事.ERMSB的一些好处(如商店的较弱排序)只会出现不适合缓存的较大缓冲区.即使是常规的快速字符串`rep movs`也应该使用无RFO协议,即使在ERMSB之前的CPU上也是如此. (2认同)
  • 最后,您应该注意除了`rep movsb`之外的其他内容实际上都实现了`memcpy`(并且它们都没有实现`memmove`),因此您需要为其他变体提供额外的代码.这只是小尺寸的问题. (2认同)

Dav*_*erd 7

你说你想要:

一个答案显示ERMSB何时有用

但我不确定这意味着你的意思.查看链接到的3.7.6.1文档,它明确说:

使用ERMSB实现memcpy可能无法达到与使用256位或128位AVX备选方案相同的吞吐量水平,具体取决于长度和对齐因子.

因此,仅仅因为CPUID表明支持ERMSB,这不能保证REP MOVSB将是复制内存的最快方式.它只是意味着它不会像以前的一些CPU那样糟糕.

然而,仅仅因为可能存在替代方案,在某些条件下,运行得更快并不意味着REP MOVSB无用.既然该指令过去产生的性能损失已经消失,那么它可能再次成为有用的指令.

请记住,与我所看到的一些更复杂的memcpy例程相比,它只是一小段代码(2个字节!).由于加载和运行大块代码也会受到惩罚(将一些其他代码从cpu的缓存中抛出),有时AVX等人的"好处"会被它对其余部分的影响所抵消.码.取决于你在做什么.

你也问:

为什么REP MOVSB的带宽要低得多?我该怎么做才能改善它?

为了让REP MOVSB运行得更快,"做某事"是不可能的.它做它做的事.

如果你想从memcpy中看到更高的速度,你可以挖掘它的来源.它在某处.或者您可以从调试器跟踪它并查看正在采用的实际代码路径.我的期望是它使用一些AVX指令一次使用128或256位.

或者你可以......嗯,你让我们不要说出来.


Nom*_*mal 6

这不是对所述问题的回答,只是我试图找出答案时的结果(和个人结论).

总结:GCC已经优化memset()/ memmove()/ memcpy()(参见例如GCC源中的gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() ;也可以stringop_algs在同一文件中查找与体系结构相关的变体).因此,没有理由期望通过使用您自己的GCC变体来获得大量收益(除非您忘记了重要的内容,例如对齐数据的对齐属性,或者没有启用足够的特定优化-O2 -march= -mtune=).如果您同意,那么所述问题的答案在实践中或多或少都无关紧要.

(我只希望有一个memrepeat(),相反memcpy()相比memmove(),这将重复缓冲器的初始部分,以填满整个缓冲器.)


我目前在使用的Ivy Bridge机(核心i5-6200U笔记本电脑,Linux的4.4.0内核x86-64的,与erms/proc/cpuinfo标志).因为我想知道我是否能找到一个基于的自定义memcpy()变体rep movsb优于简单的情况memcpy(),我写了一个过于复杂的基准.

其核心思想是主程序分配三个大的存储区:original,current,和correct,每个大小完全一样,和至少页对齐.复制操作被分组为集合,每个集合具有不同的属性,例如所有源和目标被对齐(到某个字节数),或者所有长度都在相同的范围内.每组使用的阵列描述src,dst,n三重峰,其中所有srcsrc+n-1dstdst+n-1的内完全current区.

一个Xorshift* PRNG用来初始化original随机数据.(就像我上面警告的那样,这太复杂了,但我想确保我不会为编译器留下任何简单的快捷方式.)该correct区域是通过original数据输入获得的current,应用当前集合中的所有三元组,使用memcpy()提供的由C库,并将该current区域复制到correct.这允许验证每个基准测试功能的行为是否正确.

每组复制操作使用相同的函数定时很多次,并且这些的中值用于比较.(在我看来,中位数在基准测试中最有意义,并且提供了合理的语义 - 至少在一半时间内,该函数的速度至少快.)

为了避免编译器优化,我让程序在运行时动态加载函数和基准.这些功能都具有相同的形式,void function(void *, const void *, size_t)-注意,与memcpy()memmove(),他们返回任何结果.基准(命名的复制操作集)是通过函数调用动态生成的(将指针指向current区域及其大小作为参数等).

不幸的是,我还没有找到任何设置

static void rep_movsb(void *dst, const void *src, size_t n)
{
    __asm__ __volatile__ ( "rep movsb\n\t"
                         : "+D" (dst), "+S" (src), "+c" (n)
                         :
                         : "memory" );
}
Run Code Online (Sandbox Code Playgroud)

会打败

static void normal_memcpy(void *dst, const void *src, size_t n)
{
    memcpy(dst, src, n);
}
Run Code Online (Sandbox Code Playgroud)

使用gcc -Wall -O2 -march=ivybridge -mtune=ivybridge使用GCC 5.4.0上前述核心i5-6200U膝上型运行Linux-4.4.0 64位内核.然而,复制4096字节对齐和大小的块是接近的.

这意味着至少到目前为止,我还没有找到使用rep movsbmemcpy变体有意义的情况.这并不意味着没有这种情况; 我还没找到一个.

(此时代码是一个意大利面条混乱,我比骄傲更惭愧,所以我会省略发布消息来源,除非有人问.但上面的描述应该足以写出更好的了.)


不过,这并不让我感到惊讶.C编译器可以推断出很多关于操作数指针对齐的信息,以及要复制的字节数是否是编译时常量,是2的合适幂的倍数.编译器可以并且将/应该使用此信息来替换C库memcpy()/ memmove()函数.

GCC就是这样做的(参见GCC源代码中的gcc/config/i386/i386.c:expand_set_or_movmem_via_rep() ;也可以stringop_algs在同一个文件中查找与体系结构相关的变体).实际上,memcpy()/ memset()/ memmove()已经针对相当多的x86处理器变体进行了单独优化; 如果GCC开发人员还没有包含erms支持,我会感到非常惊讶.

GCC提供了几个函数属性,开发人员可以使用它们来确保生成良好的代码 例如,alloc_align (n)告诉GCC该函数返回至少与n字节对齐的内存.应用程序或库可以通过创建"解析器函数"(返回函数指针)并使用该ifunc (resolver)属性定义函数来选择要在运行时使用的函数的哪个实现.

我在代码中使用的最常见模式之一是

some_type *pointer = __builtin_assume_aligned(ptr, alignment);
Run Code Online (Sandbox Code Playgroud)

ptr某个指针在哪里,alignment是它对齐的字节数; 然后GCC知道/假设它pointeralignment字节对齐.

另一个有用的内置,尽管更难使用正确,是__builtin_prefetch().为了最大化整体带宽/效率,我发现最小化每个子操作中的延迟会产生最佳结果.(为了将分散的元素复制到连续的临时存储,这很困难,因为预取通常涉及完整的缓存行;如果预取了太多的元素,则通过存储未使用的项来浪费大部分缓存.)