Xeon Skylake SMP 出现意外且无法解释的缓慢(和异常)内存性能

Mar*_*zzi 31 windows performance intel x86 numa

我们一直在测试使用 2 个 Xeon Gold 6154 CPU 和 Supermicro X11DPH-I 主板和 96GB RAM 的服务器,并发现与仅使用 1 个 CPU(一个插槽空),类似的双核运行相比,内存存在一些非常奇怪的性能问题CPU Haswell Xeon E5-2687Wv3(用于本系列测试,但其他 Broadwell 性能类似)、Broadwell-E i7s 和 Skylake-X i9s(用于比较)。

当涉及到各种 memcpy 功能甚至内存分配时,预计具有更快内存的 Skylake Xeon 处理器的性能将比 Haswell 更快(在下面的测试中未涵盖,因为我们找到了一种解决方法),而是安装了两个 CPU ,Skylake Xeon 的速度几乎是 Haswell Xeon 的一半,与 i7-6800k 相比甚至更低。更奇怪的是,当使用 Windows VirtualAllocExNuma 分配 NUMA 节点进行内存分配时,虽然普通内存复制功能在远程节点上的性能预计比本地节点差,但使用 SSE、MMX 和 AVX 寄存器的内存复制功能性能更佳在远程 NUMA 节点上比在本地节点上更快(什么?)。如上所述,使用 Skylake Xeons,

我不确定这是主板或 CPU 上的错误,还是 UPI 与 QPI 的错误,或者以上都不是,但是 BIOS 设置的组合似乎对此没有用。在 bios 中禁用 NUMA(未包含在测试结果中)确实提高了所有使用 SSE、MMX 和 AVX 寄存器的复制功能的性能,但所有其他普通内存复制功能也会遭受很大损失。

对于我们的测试程序,我们使用内联汇编函数和_mm内部函数进行测试,除了汇编函数之外,我们使用 Windows 10 和 Visual Studio 2017,因为 msvc++ 不会为 x64 编译 asm,我们使用来自 mingw/msys 的 gcc 到使用-c -O2标志编译 obj 文件,我们将其包含在 msvc++ 链接器中。

如果系统使用 NUMA 节点,我们使用 VirtualAllocExNuma 为每个 NUMA 节点测试内存分配的两个运算符 new,并为每个内存复制函数执行 100 个内存缓冲区副本的累积平均值,每个副本为 16MB,然后轮换我们正在进行的内存分配每组测试之间。

所有 100 个源缓冲区和 100 个目标缓冲区都是 64 字节对齐的(使用流函数与 AVX512 兼容)并初始化一次以获取源缓冲区的增量数据,以及目标缓冲区的 0xff。

每台机器上每种配置的平均副本数量各不相同,因为在某些机器上速度要快得多,而在其他机器上要慢得多。

结果如下:

Supermicro X10DAi 上的Haswell Xeon E5-2687Wv3 1 个 CPU(1 个空插槽),配备 32GB DDR4-2400(10c/20t,25 MB 的 L3 缓存)。但请记住,基准测试会循环使用 100 对 16MB 缓冲区,因此我们可能不会获得 L3 缓存命中。

---------------------------------------------------------------------------
Averaging 7000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2264.48 microseconds
asm_memcpy (asm)                 averaging 2322.71 microseconds
sse_memcpy (intrinsic)           averaging 1569.67 microseconds
sse_memcpy (asm)                 averaging 1589.31 microseconds
sse2_memcpy (intrinsic)          averaging 1561.19 microseconds
sse2_memcpy (asm)                averaging 1664.18 microseconds
mmx_memcpy (asm)                 averaging 2497.73 microseconds
mmx2_memcpy (asm)                averaging 1626.68 microseconds
avx_memcpy (intrinsic)           averaging 1625.12 microseconds
avx_memcpy (asm)                 averaging 1592.58 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2260.6 microseconds
Run Code Online (Sandbox Code Playgroud)

Haswell Dual Xeon E5-2687Wv3 2 cpu on Supermicro X10DAi with 64GB ram

---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3179.8 microseconds
asm_memcpy (asm)                 averaging 3177.15 microseconds
sse_memcpy (intrinsic)           averaging 1633.87 microseconds
sse_memcpy (asm)                 averaging 1663.8 microseconds
sse2_memcpy (intrinsic)          averaging 1620.86 microseconds
sse2_memcpy (asm)                averaging 1727.36 microseconds
mmx_memcpy (asm)                 averaging 2623.07 microseconds
mmx2_memcpy (asm)                averaging 1691.1 microseconds
avx_memcpy (intrinsic)           averaging 1704.33 microseconds
avx_memcpy (asm)                 averaging 1692.69 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3185.84 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 3992.46 microseconds
asm_memcpy (asm)                 averaging 4039.11 microseconds
sse_memcpy (intrinsic)           averaging 3174.69 microseconds
sse_memcpy (asm)                 averaging 3129.18 microseconds
sse2_memcpy (intrinsic)          averaging 3161.9 microseconds
sse2_memcpy (asm)                averaging 3141.33 microseconds
mmx_memcpy (asm)                 averaging 4010.17 microseconds
mmx2_memcpy (asm)                averaging 3211.75 microseconds
avx_memcpy (intrinsic)           averaging 3003.14 microseconds
avx_memcpy (asm)                 averaging 2980.97 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3987.91 microseconds
---------------------------------------------------------------------------
Averaging 6900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3172.95 microseconds
asm_memcpy (asm)                 averaging 3173.5 microseconds
sse_memcpy (intrinsic)           averaging 1623.84 microseconds
sse_memcpy (asm)                 averaging 1657.07 microseconds
sse2_memcpy (intrinsic)          averaging 1616.95 microseconds
sse2_memcpy (asm)                averaging 1739.05 microseconds
mmx_memcpy (asm)                 averaging 2623.71 microseconds
mmx2_memcpy (asm)                averaging 1699.33 microseconds
avx_memcpy (intrinsic)           averaging 1710.09 microseconds
avx_memcpy (asm)                 averaging 1688.34 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 3175.14 microseconds
Run Code Online (Sandbox Code Playgroud)

Skylake Xeon Gold 6154 1 个 CPU(1 个空插槽)在 Supermicro X11DPH-I 上,带有 48GB DDR4-2666(18c/36t,24.75 MB 的 L3 缓存)

---------------------------------------------------------------------------
Averaging 5000 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1832.42 microseconds
asm_memcpy (asm)                 averaging 1837.62 microseconds
sse_memcpy (intrinsic)           averaging 1647.84 microseconds
sse_memcpy (asm)                 averaging 1710.53 microseconds
sse2_memcpy (intrinsic)          averaging 1645.54 microseconds
sse2_memcpy (asm)                averaging 1794.36 microseconds
mmx_memcpy (asm)                 averaging 2030.51 microseconds
mmx2_memcpy (asm)                averaging 1816.82 microseconds
avx_memcpy (intrinsic)           averaging 1686.49 microseconds
avx_memcpy (asm)                 averaging 1716.15 microseconds
avx512_memcpy (intrinsic)        averaging 1761.6 microseconds
rep movsb (asm)                  averaging 1977.6 microseconds
Run Code Online (Sandbox Code Playgroud)

Skylake Xeon Gold 6154 2 CPU on Supermicro X11DPH-I with 96GB DDR4-2666

---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 0(local)
---------------------------------------------------------------------------
std::memcpy                      averaging 3131.6 microseconds
asm_memcpy (asm)                 averaging 3070.57 microseconds
sse_memcpy (intrinsic)           averaging 3297.72 microseconds
sse_memcpy (asm)                 averaging 3423.38 microseconds
sse2_memcpy (intrinsic)          averaging 3274.31 microseconds
sse2_memcpy (asm)                averaging 3413.48 microseconds
mmx_memcpy (asm)                 averaging 2069.53 microseconds
mmx2_memcpy (asm)                averaging 3694.91 microseconds
avx_memcpy (intrinsic)           averaging 3118.75 microseconds
avx_memcpy (asm)                 averaging 3224.36 microseconds
avx512_memcpy (intrinsic)        averaging 3156.56 microseconds
rep movsb (asm)                  averaging 3155.36 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for VirtualAllocExNuma to NUMA node 1
---------------------------------------------------------------------------
std::memcpy                      averaging 5309.77 microseconds
asm_memcpy (asm)                 averaging 5330.78 microseconds
sse_memcpy (intrinsic)           averaging 2350.61 microseconds
sse_memcpy (asm)                 averaging 2402.57 microseconds
sse2_memcpy (intrinsic)          averaging 2338.61 microseconds
sse2_memcpy (asm)                averaging 2475.51 microseconds
mmx_memcpy (asm)                 averaging 2883.97 microseconds
mmx2_memcpy (asm)                averaging 2517.69 microseconds
avx_memcpy (intrinsic)           averaging 2356.07 microseconds
avx_memcpy (asm)                 averaging 2415.22 microseconds
avx512_memcpy (intrinsic)        averaging 2487.01 microseconds
rep movsb (asm)                  averaging 5372.98 microseconds
---------------------------------------------------------------------------
Averaging 4100 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 3075.1 microseconds
asm_memcpy (asm)                 averaging 3061.97 microseconds
sse_memcpy (intrinsic)           averaging 3281.17 microseconds
sse_memcpy (asm)                 averaging 3421.38 microseconds
sse2_memcpy (intrinsic)          averaging 3268.79 microseconds
sse2_memcpy (asm)                averaging 3435.76 microseconds
mmx_memcpy (asm)                 averaging 2061.27 microseconds
mmx2_memcpy (asm)                averaging 3694.48 microseconds
avx_memcpy (intrinsic)           averaging 3111.16 microseconds
avx_memcpy (asm)                 averaging 3227.45 microseconds
avx512_memcpy (intrinsic)        averaging 3148.65 microseconds
rep movsb (asm)                  averaging 2967.45 microseconds
Run Code Online (Sandbox Code Playgroud)

华硕 ROG Rampage VI Extreme 上的Skylake-X i9-7940X,配备 32GB DDR4-4266(14c/28t,19.25 MB 三级缓存)(超频至 3.8GHz/4.4GHz turbo,DDR 4040MHz,目标 AVX 频率 3737-MHz,目标512频率3535MHz,目标缓存频率2424MHz)

---------------------------------------------------------------------------
Averaging 6500 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 1750.87 microseconds
asm_memcpy (asm)                 averaging 1748.22 microseconds
sse_memcpy (intrinsic)           averaging 1743.39 microseconds
sse_memcpy (asm)                 averaging 3120.18 microseconds
sse2_memcpy (intrinsic)          averaging 1743.37 microseconds
sse2_memcpy (asm)                averaging 2868.52 microseconds
mmx_memcpy (asm)                 averaging 2255.17 microseconds
mmx2_memcpy (asm)                averaging 3434.58 microseconds
avx_memcpy (intrinsic)           averaging 1698.49 microseconds
avx_memcpy (asm)                 averaging 2840.65 microseconds
avx512_memcpy (intrinsic)        averaging 1670.05 microseconds
rep movsb (asm)                  averaging 1718.77 microseconds
Run Code Online (Sandbox Code Playgroud)

华硕X99上的Broadwell i7-6800k,配备 24GB DDR4-2400(6c/12t,15 MB 的三级缓存)

---------------------------------------------------------------------------
Averaging 64900 copies of 16MB of data per function for operator new
---------------------------------------------------------------------------
std::memcpy                      averaging 2522.1 microseconds
asm_memcpy (asm)                 averaging 2615.92 microseconds
sse_memcpy (intrinsic)           averaging 1621.81 microseconds
sse_memcpy (asm)                 averaging 1669.39 microseconds
sse2_memcpy (intrinsic)          averaging 1617.04 microseconds
sse2_memcpy (asm)                averaging 1719.06 microseconds
mmx_memcpy (asm)                 averaging 3021.02 microseconds
mmx2_memcpy (asm)                averaging 1691.68 microseconds
avx_memcpy (intrinsic)           averaging 1654.41 microseconds
avx_memcpy (asm)                 averaging 1666.84 microseconds
avx512_memcpy (intrinsic)        unsupported on this CPU
rep movsb (asm)                  averaging 2520.13 microseconds
Run Code Online (Sandbox Code Playgroud)

汇编函数源自xine-libs中的fast_memcpy,主要用于与msvc++的优化器进行比较。

测试的源代码可以在https://github.com/marcmicalizzi/memcpy_test 获取(贴上有点长)

有没有其他人遇到过这个问题,或者有没有人对为什么会发生这种情况有任何见解?


更新 2018-05-15 13:40EST

因此,正如 Peter Cordes 所建议的那样,我更新了测试以比较预取与未预取、NT 存储与常规存储,并调整了每个函数中完成的预取(我在编写预取方面没有任何有意义的经验,所以如果我在这方面犯了任何错误,请告诉我,我会相应地调整测试。预取确实有影响,所以至少它正在做一些事情)。这些更改反映在我之前为寻找源代码的任何人制作的 GitHub 链接的最新版本中。

我还添加一个SSE4.1的memcpy,由于之前SSE4.1我无法找到任何_mm_stream_load(I专门用于_mm_stream_load_si128)SSE的功能,所以sse_memcpysse2_memcpy使用NT商店不能完全,并且还有该avx_memcpy函数使用AVX2功能用于流加载。

我选择不对纯存储和纯加载访问模式进行测试,因为我不确定纯存储是否有意义,因为如果没有加载到它正在访问的寄存器,数据将毫无意义且无法验证。

新测试的有趣结果是,在 Xeon Skylake Dual Socket 设置上,并且在该设置上,对于 16MB 内存复制,存储功能实际上比 NT 流功能快得多。同样在该设置上(并且仅在 BIOS 中启用了 LLC 预取),在某些测试(SSE、SSE4.1)中 prefetchnta 的性能优于 prefetcht0 和不预取。

这个新测试的原始结果太长,无法添加到帖子中,因此将它们发布在与源代码相同的git存储库中 results-2018-05-15

我仍然不明白为什么对于流式 NT 存储,远程 NUMA 节点在 Skylake SMP 设置下更快,尽管使用常规存储仍然比本地 NUMA 节点上更快

the*_*ger 0

你的记忆等级是不是不正确?也许当您添加第二个 CPU 时,您的主板的内存排名出现了一些奇怪的情况?我知道当你有四核 CPU 机器时,它们会做各种奇怪的事情来使内存正常工作,如果你的内存排名不正确,有时它会工作,但时钟速度会回到 1/4 或 1/2 左右。也许 SuperMicro 在该主板上做了一些事情,将 DDR4 和双 CPU 变成了四通道,并且它使用了类似的数学。不正确的等级 == 1/2 速度。