Jam*_*ock 33 c++ linux performance linux-kernel numa
我试图测量NUMA的非对称内存访问效果,但都失败了.
采用Intel Xeon X5570 @ 2.93GHz,2个CPU,8个内核.
在固定到核心0的线程上,我使用numa_alloc_local在核心0的NUMA节点上分配大小为10,000,000字节的数组x.然后我迭代数组x 50次并读取和写入数组中的每个字节.测量进行50次迭代所用的时间.
然后,在我的服务器中的每个其他核心上,我固定一个新线程并再次测量经过50次迭代读取和写入数组x中每个字节的时间.
数组x很大,可以最大限度地减少缓存效 我们想要测量CPU必须一直到RAM加载和存储的速度,而不是在缓存有帮助的时候.
我的服务器中有两个NUMA节点,因此我希望在分配了数组x的同一节点上具有亲缘关系的核心具有更快的读/写速度.我没有看到.
为什么?
也许NUMA仅与8-12核心的系统相关,正如我在别处看到的那样?
http://lse.sourceforge.net/numa/faq/
#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
for(size_t i=0;i<bm.size;++i)
{
os << numa_bitmask_isbitset(&bm, i);
}
return os;
}
void* thread1(void** x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
void* y = numa_alloc_local(N);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)y)[j];
((char*)y)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;
*x = y;
}
void thread2(void* x, size_t core, size_t N, size_t M)
{
pin_to_core(core);
boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();
char c;
for (size_t i(0);i<M;++i)
for(size_t j(0);j<N;++j)
{
c = ((char*)x)[j];
((char*)x)[j] = c;
}
boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();
std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}
int main(int argc, const char **argv)
{
int numcpus = numa_num_task_cpus();
std::cout << "numa_available() " << numa_available() << std::endl;
numa_set_localalloc();
bitmask* bm = numa_bitmask_alloc(numcpus);
for (int i=0;i<=numa_max_node();++i)
{
numa_node_to_cpus(i, bm);
std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
}
numa_bitmask_free(bm);
void* x;
size_t N(10000000);
size_t M(50);
boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
t1.join();
for (size_t i(0);i<numcpus;++i)
{
boost::thread t2(boost::bind(&thread2, x, i, N, M));
t2.join();
}
numa_free(x, N);
return 0;
}
Run Code Online (Sandbox Code Playgroud)
g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp
./numatest
numa_available() 0 <-- NUMA is available on this system
numa node 0 10101010 12884901888 <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064 <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0
Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928
Run Code Online (Sandbox Code Playgroud)
无论哪个核心正在进行读写操作,在数组x上进行50次迭代读取和写入大约需要1.7秒.
我的CPU上的缓存大小是8Mb,因此10Mb阵列x可能不足以消除缓存效果.我尝试了100Mb数组x,我尝试在最里面的循环中用__sync_synchronize()发出一个完整的内存栅栏.它仍然没有揭示NUMA节点之间的任何不对称性.
我尝试使用__sync_fetch_and_add()读取和写入数组x.依然没有.
Mys*_*ial 19
我想要指出的第一件事是你可能想要仔细检查每个节点上的核心.我不记得核心和节点是这样交错的.此外,由于HT,您应该有16个线程.(除非你禁用它)
另一件事:
插座1366 Xeon机器只有轻微的NUMA.所以很难看出差异.4P Opterons的NUMA效果更加明显.
在像您这样的系统上,节点到节点的带宽实际上比CPU到内存带宽更快.由于您的访问模式是完全顺序的,因此无论数据是否为本地数据,您都将获得全部带宽.衡量一个更好的方法是延迟.尝试随机访问1 GB的块,而不是按顺序流式传输.
最后一件事:
根据编译器优化的积极程度,您的循环可能会被优化掉,因为它不会执行任何操作:
c = ((char*)x)[j];
((char*)x)[j] = c;
Run Code Online (Sandbox Code Playgroud)
这样的事情将保证编译器不会消除它:
((char*)x)[j] += 1;
Run Code Online (Sandbox Code Playgroud)
Jam*_*ock 15
啊哈!神秘是对的!不知何故,硬件预取正在优化我的读/写.
如果是缓存优化,那么强制内存屏障会使优化失败:
c = __sync_fetch_and_add(((char*)x) + j, 1);
Run Code Online (Sandbox Code Playgroud)
但这没有任何区别.有什么区别是将我的迭代器索引乘以素数1009以取消预取优化:
*(((char*)x) + ((j * 1009) % N)) += 1;
Run Code Online (Sandbox Code Playgroud)
随着这一变化,NUMA的不对称性得到了明确的揭示:
numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114
Run Code Online (Sandbox Code Playgroud)
至少我认为这是正在发生的事情.
谢谢神秘!
编辑:结论~133%
对于那些只是浏览这篇文章以大致了解NUMA的性能特征的人来说,根据我的测试,这里是底线:
对非本地NUMA节点的内存访问大约是对本地节点的内存访问延迟的1.33倍.
And*_*ner 10
感谢您的基准代码.我已经采用了你的"固定"版本并将其更改为纯C + OpenMP,并添加了一些关于内存系统在争用情况下的行为的测试.你可以在这里找到新代码.
以下是Quad Opteron的一些示例结果:
num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB
sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s
all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s
two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s
two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s
two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s
two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s
two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s
two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s
two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s
two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s
two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s
two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s
two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s
two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s
two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s
two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s
two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s
two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s
two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s
two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s
two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s
two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s
two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s
two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s
two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s
two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s
two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s
two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s
two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s
two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s
two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s
two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s
two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s
Run Code Online (Sandbox Code Playgroud)
如果有人有进一步的改进,我会很高兴听到他们.例如,这些显然不是真实世界单位的完美带宽测量(可能偏离 - 希望是恒定的 - 整数因子).
一些评论:
lstopo实用程序获取图形概览.特别是,您将看到哪些核心编号是哪个NUMA节点(处理器插槽)的成员char可能不是衡量最大RAM吞吐量的理想数据类型.我怀疑使用32位或64位数据类型,您可以通过相同数量的cpu周期获得更多数据.更一般地说,您还应该检查您的测量不受CPU速度的限制,而是受RAM速度的限制.例如,ramspeed实用程序在源代码中显式地展开内部循环:
for(i = 0; i < blk/sizeof(UTL); i += 32) {
b[i] = a[i]; b[i+1] = a[i+1];
...
b[i+30] = a[i+30]; b[i+31] = a[i+31];
}
Run Code Online (Sandbox Code Playgroud)
编辑:在支持的体系结构上ramsmp实际上甚至使用"手写"汇编代码来实现这些循环
L1/L2/L3高速缓存效果:以GByte/s为单位测量带宽作为块大小的函数是有益的.在增加与从(缓存或主存储器)读取数据的位置相对应的块大小时,您应该看到大致四种不同的速度.您的处理器似乎有8 MB的Level3(?)缓存,因此您的1000万字节可能主要保留在L3缓存中(在一个处理器的所有内核之间共享).
内存通道:您的处理器有3个内存通道.如果您的存储库已安装,您可以全部利用它们(参见例如主板的手册),您可能希望同时运行多个线程.我看到的效果是,当只用一个线程读取时,渐近带宽接近单个内存模块的一个(例如DDR-1600为12.8 GByte/s),而在运行多个线程时,渐近带宽接近数字内存通道的时间乘以单个内存模块的带宽.
您还可以使用 numactl 来选择在哪个节点上运行进程以及从哪里分配内存:
numactl --cpubind=0 --membind=1 <process>
Run Code Online (Sandbox Code Playgroud)
我将其与LMbench结合使用来获取内存延迟数:
numactl --cpubind=0 --membind=0 ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1 ./lat_mem_rd -t 512
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
9114 次 |
| 最近记录: |