AMD 24 核服务器内存带宽

Question

AMD 24 核服务器内存带宽

nth*_*ing 8 performance memory central-processing-unit hp numa

我需要一些帮助来确定我在服务器上的 Linux 下看到的内存带宽是否正常。这是服务器规范：

HP ProLiant DL165 G7
2x AMD Opteron 6164 HE 12-Core
40 GB RAM (10 x 4GB DDR1333)
Debian 6.0

Run Code Online (Sandbox Code Playgroud)

mbw在这台服务器上使用我得到以下数字：

foo1:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.58047    MiB: 1024.00000 Copy: 1764.082 MiB/s
1   Method: MEMCPY  Elapsed: 0.58012    MiB: 1024.00000 Copy: 1765.152 MiB/s
2   Method: MEMCPY  Elapsed: 0.58010    MiB: 1024.00000 Copy: 1765.201 MiB/s
AVG Method: MEMCPY  Elapsed: 0.58023    MiB: 1024.00000 Copy: 1764.811 MiB/s
0   Method: DUMB    Elapsed: 0.36174    MiB: 1024.00000 Copy: 2830.778 MiB/s
1   Method: DUMB    Elapsed: 0.35869    MiB: 1024.00000 Copy: 2854.817 MiB/s
2   Method: DUMB    Elapsed: 0.35848    MiB: 1024.00000 Copy: 2856.481 MiB/s
AVG Method: DUMB    Elapsed: 0.35964    MiB: 1024.00000 Copy: 2847.310 MiB/s
0   Method: MCBLOCK Elapsed: 0.23546    MiB: 1024.00000 Copy: 4348.860 MiB/s
1   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.230 MiB/s
2   Method: MCBLOCK Elapsed: 0.23544    MiB: 1024.00000 Copy: 4349.359 MiB/s
AVG Method: MCBLOCK Elapsed: 0.23545    MiB: 1024.00000 Copy: 4349.149 MiB/s

Run Code Online (Sandbox Code Playgroud)

在我的另一台服务器上（基于 Intel Xeon E3-1270）：

foo2:~# mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.18960    MiB: 1024.00000 Copy: 5400.901 MiB/s
1   Method: MEMCPY  Elapsed: 0.18922    MiB: 1024.00000 Copy: 5411.690 MiB/s
2   Method: MEMCPY  Elapsed: 0.18944    MiB: 1024.00000 Copy: 5405.491 MiB/s
AVG Method: MEMCPY  Elapsed: 0.18942    MiB: 1024.00000 Copy: 5406.024 MiB/s
0   Method: DUMB    Elapsed: 0.14838    MiB: 1024.00000 Copy: 6901.200 MiB/s
1   Method: DUMB    Elapsed: 0.14818    MiB: 1024.00000 Copy: 6910.561 MiB/s
2   Method: DUMB    Elapsed: 0.14820    MiB: 1024.00000 Copy: 6909.628 MiB/s
AVG Method: DUMB    Elapsed: 0.14825    MiB: 1024.00000 Copy: 6907.127 MiB/s
0   Method: MCBLOCK Elapsed: 0.04362    MiB: 1024.00000 Copy: 23477.623 MiB/s
1   Method: MCBLOCK Elapsed: 0.04262    MiB: 1024.00000 Copy: 24025.151 MiB/s
2   Method: MCBLOCK Elapsed: 0.04258    MiB: 1024.00000 Copy: 24048.849 MiB/s
AVG Method: MCBLOCK Elapsed: 0.04294    MiB: 1024.00000 Copy: 23847.599 MiB/s

Run Code Online (Sandbox Code Playgroud)

以下是我在基于英特尔的笔记本电脑上得到的参考：

laptop:~$ mbw -n 3 1024
Long uses 8 bytes. Allocating 2*134217728 elements = 2147483648 bytes of memory.
Using 262144 bytes as blocks for memcpy block copy test.
Getting down to business... Doing 3 runs per test.
0   Method: MEMCPY  Elapsed: 0.40566    MiB: 1024.00000 Copy: 2524.269 MiB/s
1   Method: MEMCPY  Elapsed: 0.38458    MiB: 1024.00000 Copy: 2662.638 MiB/s
2   Method: MEMCPY  Elapsed: 0.38876    MiB: 1024.00000 Copy: 2634.043 MiB/s
AVG Method: MEMCPY  Elapsed: 0.39300    MiB: 1024.00000 Copy: 2605.600 MiB/s
0   Method: DUMB    Elapsed: 0.30707    MiB: 1024.00000 Copy: 3334.745 MiB/s
1   Method: DUMB    Elapsed: 0.30425    MiB: 1024.00000 Copy: 3365.653 MiB/s
2   Method: DUMB    Elapsed: 0.30342    MiB: 1024.00000 Copy: 3374.849 MiB/s
AVG Method: DUMB    Elapsed: 0.30491    MiB: 1024.00000 Copy: 3358.328 MiB/s
0   Method: MCBLOCK Elapsed: 0.07875    MiB: 1024.00000 Copy: 13003.670 MiB/s
1   Method: MCBLOCK Elapsed: 0.08374    MiB: 1024.00000 Copy: 12228.034 MiB/s
2   Method: MCBLOCK Elapsed: 0.07635    MiB: 1024.00000 Copy: 13411.216 MiB/s
AVG Method: MCBLOCK Elapsed: 0.07961    MiB: 1024.00000 Copy: 12862.006 MiB/s

Run Code Online (Sandbox Code Playgroud)

所以根据mbw 我的笔记本电脑比服务器快 3 倍！！！请帮我解释一下。我还尝试安装一个 ram 磁盘并使用 dd 对其进行基准测试，我得到了类似的差异，所以我认为这不是mbw罪魁祸首。

我检查了 BIOS 设置，内存似乎全速运行。根据托管公司的说法，这些模块都没有问题。

这可能与NUMA有关吗？这台服务器上似乎禁用了节点交错。启用它（从而关闭 NUMA）会有所作为吗？

foo1:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8190 MB
node 0 free: 7898 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 12288 MB
node 1 free: 12073 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 12288 MB
node 2 free: 12034 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8192 MB
node 3 free: 8032 MB
node distances:
node   0   1   2   3 
  0:  10  20  20  20 
  1:  20  10  20  20 
  2:  20  20  10  20 
  3:  20  20  20  10

Run Code Online (Sandbox Code Playgroud)

更新：

已禁用 NUMA（linux 启动时 numa=off）并在 BIOS 中禁用 ECC。没有变化，仍然与上面相同的数字。

更新 2：

这是根据以下内容的内存布局dmidecode：

PROC 1 DIMM 1
PROC 1 DIMM 4
PROC 1 DIMM 7
PROC 1 DIMM 10
PROC 1 DIMM 12

PROC 2 DIMM 1
PROC 2 DIMM 4
PROC 2 DIMM 7
PROC 2 DIMM 10
PROC 2 DIMM 12

Run Code Online (Sandbox Code Playgroud)

这些都是4GB 三星模块（部件号 M393B5270CH0-CH9）

我查看了有关如何在此服务器中填充内存的HP 文档，如果我理解正确，则当前位于 DIMM 12 中的模块应该已放置在 DIMM 3 插槽中。这样的错误配置能解释我得到的结果吗？

更新 3：

我现在已经移除了 2 个模块，以便在 1-4-7-10 中的每一侧 (4-4) 获得 4x4 GB。不幸的是，我没有看到基准测试有任何差异。服务器现在不应该可以使用所有四个通道吗？我还尝试了stream多线程的基准测试，结果非常令人失望。我唯一能想到的就是要求托管公司更换整个服务器......

更新 4：

当我stream昨天测试最后一个设置（32 GB）时，我一定是做错了什么，因为今天我看到了很好的结果：

foo1:~# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2000000, Offset = 0
Total memory required = 45.8 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 24
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 703 microseconds.
   (= 703 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       36873.0022       0.0009       0.0009       0.0010
Scale:      34699.5160       0.0009       0.0009       0.0010
Add:        30868.8427       0.0016       0.0016       0.0017
Triad:      25558.7904       0.0019       0.0019       0.0020
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Run Code Online (Sandbox Code Playgroud)

（我已经放弃了，mbw因为它只在单线程模式下运行。它在这个服务器上仍然给出同样糟糕的结果）。

所以问题一定是最后两个 4GB 模块迫使服务器在单通道模式下运行，就像@chx 在下面指出的那样。现在唯一剩下的问题是是否有可能使用 40 GB 并仍然获得全部带宽？我可以使用 2 x 8GB + 6 x 4GB 吗？我将较大的模块放置在哪个通道中是否重要？

Answer 1

chx*_*chx 8

您通过每个 CPU 使用 5-5 个模块而不是 4-4 或 8-8 个模块来强制系统在单通道 (!) 模式下运行。这就是原因。尝试删除 1 - 1 并报告。

6164 是 G34 插槽 CPU，如果内存模块设置正确，则能够进行四通道操作。你的设置是最糟糕的。

很好地抓住了 DIMM 人口！:) (2认同)
好吧，你可能是对的。但我的书呆子不能让它成为这样！:-) 如果您还没有厌倦我，请帮我解释一下：他们现在已经重新添加了取出的 4GB 模块，然后再将其总共 40GB 和 5-5 个。流再次给出了糟糕的结果。但是我只是尝试删除 numa=off 启动选项，并且在重新启动流后，我的结果接近上次更新中使用 32GB 时看到的出色结果。 (2认同)

归档时间：	13 年，4 月前
查看次数：	1006 次
最近记录：	13 年，4 月前