测量NUMA(非统一内存访问).没有可观察到的不对称性.为什么?

Jam*_*ock 33 c++ linux performance linux-kernel numa

我试图测量NUMA的非对称内存访问效果,但都失败了.

本实验

采用Intel Xeon X5570 @ 2.93GHz,2个CPU,8个内核.

在固定到核心0的线程上,我使用numa_alloc_local在核心0的NUMA节点上分配大小为10,000,000字节的数组x.然后我迭代数组x 50次并读取和写入数组中的每个字节.测量进行50次迭代所用的时间.

然后,在我的服务器中的每个其他核心上,我固定一个新线程并再次测量经过50次迭代读取和写入数组x中每个字节的时间.

数组x很大,可以最大限度地减少缓存效 我们想要测量CPU必须一直到RAM加载和存储的速度,而不是在缓存有帮助的时候.

我的服务器中有两个NUMA节点,因此我希望在分配了数组x的同一节点上具有亲缘关系的核心具有更快的读/写速度.我没有看到.

为什么?

也许NUMA仅与8-12核心的系统相关,正如我在别处看到的那样?

http://lse.sourceforge.net/numa/faq/

numatest.cpp

#include <numa.h>
#include <iostream>
#include <boost/thread/thread.hpp>
#include <boost/date_time/posix_time/posix_time.hpp>
#include <pthread.h>

void pin_to_core(size_t core)
{
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

std::ostream& operator<<(std::ostream& os, const bitmask& bm)
{
    for(size_t i=0;i<bm.size;++i)
    {
        os << numa_bitmask_isbitset(&bm, i);
    }
    return os;
}

void* thread1(void** x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    void* y = numa_alloc_local(N);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)y)[j];
            ((char*)y)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by same thread that allocated on core " << core << ": " << (t2 - t1) << std::endl;

    *x = y;
}

void thread2(void* x, size_t core, size_t N, size_t M)
{
    pin_to_core(core);

    boost::posix_time::ptime t1 = boost::posix_time::microsec_clock::universal_time();

    char c;
    for (size_t i(0);i<M;++i)
        for(size_t j(0);j<N;++j)
        {
            c = ((char*)x)[j];
            ((char*)x)[j] = c;
        }

    boost::posix_time::ptime t2 = boost::posix_time::microsec_clock::universal_time();

    std::cout << "Elapsed read/write by thread on core " << core << ": " << (t2 - t1) << std::endl;
}

int main(int argc, const char **argv)
{
    int numcpus = numa_num_task_cpus();
    std::cout << "numa_available() " << numa_available() << std::endl;
    numa_set_localalloc();

    bitmask* bm = numa_bitmask_alloc(numcpus);
    for (int i=0;i<=numa_max_node();++i)
    {
        numa_node_to_cpus(i, bm);
        std::cout << "numa node " << i << " " << *bm << " " << numa_node_size(i, 0) << std::endl;
    }
    numa_bitmask_free(bm);

    void* x;
    size_t N(10000000);
    size_t M(50);

    boost::thread t1(boost::bind(&thread1, &x, 0, N, M));
    t1.join();

    for (size_t i(0);i<numcpus;++i)
    {
        boost::thread t2(boost::bind(&thread2, x, i, N, M));
        t2.join();
    }

    numa_free(x, N);

    return 0;
}
Run Code Online (Sandbox Code Playgroud)

输出

g++ -o numatest -pthread -lboost_thread -lnuma -O0 numatest.cpp

./numatest

numa_available() 0                    <-- NUMA is available on this system
numa node 0 10101010 12884901888      <-- cores 0,2,4,6 are on NUMA node 0, which is about 12 Gb
numa node 1 01010101 12874584064      <-- cores 1,3,5,7 are on NUMA node 1, which is slightly smaller than node 0

Elapsed read/write by same thread that allocated on core 0: 00:00:01.767428
Elapsed read/write by thread on core 0: 00:00:01.760554
Elapsed read/write by thread on core 1: 00:00:01.719686
Elapsed read/write by thread on core 2: 00:00:01.708830
Elapsed read/write by thread on core 3: 00:00:01.691560
Elapsed read/write by thread on core 4: 00:00:01.686912
Elapsed read/write by thread on core 5: 00:00:01.691917
Elapsed read/write by thread on core 6: 00:00:01.686509
Elapsed read/write by thread on core 7: 00:00:01.689928
Run Code Online (Sandbox Code Playgroud)

无论哪个核心正在进行读写操作,在数组x上进行50次迭代读取和写入大约需要1.7秒.

更新:

我的CPU上的缓存大小是8Mb,因此10Mb阵列x可能不足以消除缓存效果.我尝试了100Mb数组x,我尝试在最里面的循环中用__sync_synchronize()发出一个完整的内存栅栏.它仍然没有揭示NUMA节点之间的任何不对称性.

更新2:

我尝试使用__sync_fetch_and_add()读取和写入数组x.依然没有.

Mys*_*ial 19

我想要指出的第一件事是你可能想要仔细检查每个节点上的核心.我不记得核心和节点是这样交错的.此外,由于HT,您应该有16个线程.(除非你禁用它)

另一件事:

插座1366 Xeon机器只有轻微的NUMA.所以很难看出差异.4P Opterons的NUMA效果更加明显.

在像您这样的系统上,节点到节点的带宽实际上比CPU到内存带宽更快.由于您的访问模式是完全顺序的,因此无论数据是否为本地数据,您都将获得全部带宽.衡量一个更好的方法是延迟.尝试随机访问1 GB的块,而不是按顺序流式传输.

最后一件事:

根据编译器优化的积极程度,您的循环可能会被优化掉,因为它不会执行任何操作:

c = ((char*)x)[j];
((char*)x)[j] = c;
Run Code Online (Sandbox Code Playgroud)

这样的事情将保证编译器不会消除它:

((char*)x)[j] += 1;
Run Code Online (Sandbox Code Playgroud)


Jam*_*ock 15

啊哈!神秘是对的!不知何故,硬件预取正在优化我的读/写.

如果是缓存优化,那么强制内存屏障会使优化失败:

c = __sync_fetch_and_add(((char*)x) + j, 1);
Run Code Online (Sandbox Code Playgroud)

但这没有任何区别.有什么区别是将我的迭代器索引乘以素数1009以取消预取优化:

*(((char*)x) + ((j * 1009) % N)) += 1;
Run Code Online (Sandbox Code Playgroud)

随着这一变化,NUMA的不对称性得到了明确的揭示:

numa_available() 0
numa node 0 10101010 12884901888
numa node 1 01010101 12874584064
Elapsed read/write by same thread that allocated on core 0: 00:00:00.961725
Elapsed read/write by thread on core 0: 00:00:00.942300
Elapsed read/write by thread on core 1: 00:00:01.216286
Elapsed read/write by thread on core 2: 00:00:00.909353
Elapsed read/write by thread on core 3: 00:00:01.218935
Elapsed read/write by thread on core 4: 00:00:00.898107
Elapsed read/write by thread on core 5: 00:00:01.211413
Elapsed read/write by thread on core 6: 00:00:00.898021
Elapsed read/write by thread on core 7: 00:00:01.207114
Run Code Online (Sandbox Code Playgroud)

至少我认为这是正在发生的事情.

谢谢神秘!

编辑:结论~133%

对于那些只是浏览这篇文章以大致了解NUMA的性能特征的人来说,根据我的测试,这里是底线:

对非本地NUMA节点的内存访问大约是对本地节点的内存访问延迟的1.33倍.


And*_*ner 10

感谢您的基准代码.我已经采用了你的"固定"版本并将其更改为纯C + OpenMP,并添加了一些关于内存系统在争用情况下的行为的测试.你可以在这里找到新代码.

以下是Quad Opteron的一些示例结果:

num cpus: 32
numa available: 0
numa node 0 10001000100010000000000000000000 - 15.9904 GiB
numa node 1 00000000000000001000100010001000 - 16 GiB
numa node 2 00010001000100010000000000000000 - 16 GiB
numa node 3 00000000000000000001000100010001 - 16 GiB
numa node 4 00100010001000100000000000000000 - 16 GiB
numa node 5 00000000000000000010001000100010 - 16 GiB
numa node 6 01000100010001000000000000000000 - 16 GiB
numa node 7 00000000000000000100010001000100 - 16 GiB

sequential core 0 -> core 0 : BW 4189.87 MB/s
sequential core 1 -> core 0 : BW 2409.1 MB/s
sequential core 2 -> core 0 : BW 2495.61 MB/s
sequential core 3 -> core 0 : BW 2474.62 MB/s
sequential core 4 -> core 0 : BW 4244.45 MB/s
sequential core 5 -> core 0 : BW 2378.34 MB/s
sequential core 6 -> core 0 : BW 2442.93 MB/s
sequential core 7 -> core 0 : BW 2468.61 MB/s
sequential core 8 -> core 0 : BW 4220.48 MB/s
sequential core 9 -> core 0 : BW 2442.88 MB/s
sequential core 10 -> core 0 : BW 2388.11 MB/s
sequential core 11 -> core 0 : BW 2481.87 MB/s
sequential core 12 -> core 0 : BW 4273.42 MB/s
sequential core 13 -> core 0 : BW 2381.28 MB/s
sequential core 14 -> core 0 : BW 2449.87 MB/s
sequential core 15 -> core 0 : BW 2485.48 MB/s
sequential core 16 -> core 0 : BW 2938.08 MB/s
sequential core 17 -> core 0 : BW 2082.12 MB/s
sequential core 18 -> core 0 : BW 2041.84 MB/s
sequential core 19 -> core 0 : BW 2060.47 MB/s
sequential core 20 -> core 0 : BW 2944.13 MB/s
sequential core 21 -> core 0 : BW 2111.06 MB/s
sequential core 22 -> core 0 : BW 2063.37 MB/s
sequential core 23 -> core 0 : BW 2082.75 MB/s
sequential core 24 -> core 0 : BW 2958.05 MB/s
sequential core 25 -> core 0 : BW 2091.85 MB/s
sequential core 26 -> core 0 : BW 2098.73 MB/s
sequential core 27 -> core 0 : BW 2083.7 MB/s
sequential core 28 -> core 0 : BW 2934.43 MB/s
sequential core 29 -> core 0 : BW 2048.68 MB/s
sequential core 30 -> core 0 : BW 2087.6 MB/s
sequential core 31 -> core 0 : BW 2014.68 MB/s

all-contention core 0 -> core 0 : BW 1081.85 MB/s
all-contention core 1 -> core 0 : BW 299.177 MB/s
all-contention core 2 -> core 0 : BW 298.853 MB/s
all-contention core 3 -> core 0 : BW 263.735 MB/s
all-contention core 4 -> core 0 : BW 1081.93 MB/s
all-contention core 5 -> core 0 : BW 299.177 MB/s
all-contention core 6 -> core 0 : BW 299.63 MB/s
all-contention core 7 -> core 0 : BW 263.795 MB/s
all-contention core 8 -> core 0 : BW 1081.98 MB/s
all-contention core 9 -> core 0 : BW 299.177 MB/s
all-contention core 10 -> core 0 : BW 300.149 MB/s
all-contention core 11 -> core 0 : BW 262.905 MB/s
all-contention core 12 -> core 0 : BW 1081.89 MB/s
all-contention core 13 -> core 0 : BW 299.173 MB/s
all-contention core 14 -> core 0 : BW 299.025 MB/s
all-contention core 15 -> core 0 : BW 263.865 MB/s
all-contention core 16 -> core 0 : BW 432.156 MB/s
all-contention core 17 -> core 0 : BW 233.12 MB/s
all-contention core 18 -> core 0 : BW 232.889 MB/s
all-contention core 19 -> core 0 : BW 202.48 MB/s
all-contention core 20 -> core 0 : BW 434.299 MB/s
all-contention core 21 -> core 0 : BW 233.274 MB/s
all-contention core 22 -> core 0 : BW 233.144 MB/s
all-contention core 23 -> core 0 : BW 202.505 MB/s
all-contention core 24 -> core 0 : BW 434.295 MB/s
all-contention core 25 -> core 0 : BW 233.274 MB/s
all-contention core 26 -> core 0 : BW 233.169 MB/s
all-contention core 27 -> core 0 : BW 202.49 MB/s
all-contention core 28 -> core 0 : BW 434.295 MB/s
all-contention core 29 -> core 0 : BW 233.309 MB/s
all-contention core 30 -> core 0 : BW 233.169 MB/s
all-contention core 31 -> core 0 : BW 202.526 MB/s

two-contention core 0 -> core 0 : BW 3306.11 MB/s
two-contention core 1 -> core 0 : BW 2199.7 MB/s

two-contention core 0 -> core 0 : BW 3286.21 MB/s
two-contention core 2 -> core 0 : BW 2220.73 MB/s

two-contention core 0 -> core 0 : BW 3302.24 MB/s
two-contention core 3 -> core 0 : BW 2182.81 MB/s

two-contention core 0 -> core 0 : BW 3605.88 MB/s
two-contention core 4 -> core 0 : BW 3605.88 MB/s

two-contention core 0 -> core 0 : BW 3297.08 MB/s
two-contention core 5 -> core 0 : BW 2217.82 MB/s

two-contention core 0 -> core 0 : BW 3312.69 MB/s
two-contention core 6 -> core 0 : BW 2227.04 MB/s

two-contention core 0 -> core 0 : BW 3287.93 MB/s
two-contention core 7 -> core 0 : BW 2209.48 MB/s

two-contention core 0 -> core 0 : BW 3660.05 MB/s
two-contention core 8 -> core 0 : BW 3660.05 MB/s

two-contention core 0 -> core 0 : BW 3339.63 MB/s
two-contention core 9 -> core 0 : BW 2223.84 MB/s

two-contention core 0 -> core 0 : BW 3303.77 MB/s
two-contention core 10 -> core 0 : BW 2197.99 MB/s

two-contention core 0 -> core 0 : BW 3323.19 MB/s
two-contention core 11 -> core 0 : BW 2196.08 MB/s

two-contention core 0 -> core 0 : BW 3582.23 MB/s
two-contention core 12 -> core 0 : BW 3582.22 MB/s

two-contention core 0 -> core 0 : BW 3324.9 MB/s
two-contention core 13 -> core 0 : BW 2250.74 MB/s

two-contention core 0 -> core 0 : BW 3305.66 MB/s
two-contention core 14 -> core 0 : BW 2209.5 MB/s

two-contention core 0 -> core 0 : BW 3303.52 MB/s
two-contention core 15 -> core 0 : BW 2182.43 MB/s

two-contention core 0 -> core 0 : BW 3352.74 MB/s
two-contention core 16 -> core 0 : BW 2607.73 MB/s

two-contention core 0 -> core 0 : BW 3092.65 MB/s
two-contention core 17 -> core 0 : BW 1911.98 MB/s

two-contention core 0 -> core 0 : BW 3025.91 MB/s
two-contention core 18 -> core 0 : BW 1918.06 MB/s

two-contention core 0 -> core 0 : BW 3257.56 MB/s
two-contention core 19 -> core 0 : BW 1885.03 MB/s

two-contention core 0 -> core 0 : BW 3339.64 MB/s
two-contention core 20 -> core 0 : BW 2603.06 MB/s

two-contention core 0 -> core 0 : BW 3119.29 MB/s
two-contention core 21 -> core 0 : BW 1918.6 MB/s

two-contention core 0 -> core 0 : BW 3054.14 MB/s
two-contention core 22 -> core 0 : BW 1910.61 MB/s

two-contention core 0 -> core 0 : BW 3214.44 MB/s
two-contention core 23 -> core 0 : BW 1881.69 MB/s

two-contention core 0 -> core 0 : BW 3332.3 MB/s
two-contention core 24 -> core 0 : BW 2611.8 MB/s

two-contention core 0 -> core 0 : BW 3111.94 MB/s
two-contention core 25 -> core 0 : BW 1922.11 MB/s

two-contention core 0 -> core 0 : BW 3049.02 MB/s
two-contention core 26 -> core 0 : BW 1912.85 MB/s

two-contention core 0 -> core 0 : BW 3251.88 MB/s
two-contention core 27 -> core 0 : BW 1881.82 MB/s

two-contention core 0 -> core 0 : BW 3345.6 MB/s
two-contention core 28 -> core 0 : BW 2598.82 MB/s

two-contention core 0 -> core 0 : BW 3109.04 MB/s
two-contention core 29 -> core 0 : BW 1923.81 MB/s

two-contention core 0 -> core 0 : BW 3062.94 MB/s
two-contention core 30 -> core 0 : BW 1921.3 MB/s

two-contention core 0 -> core 0 : BW 3220.8 MB/s
two-contention core 31 -> core 0 : BW 1901.76 MB/s
Run Code Online (Sandbox Code Playgroud)

如果有人有进一步的改进,我会很高兴听到他们.例如,这些显然不是真实世界单位的完美带宽测量(可能偏离 - 希望是恒定的 - 整数因子).


And*_*ner 5

一些评论:

  • 要查看系统的NUMA结构(在Linux上),您可以使用hwloc库中的lstopo实用程序获取图形概览.特别是,您将看到哪些核心编号是哪个NUMA节点(处理器插槽)的成员
  • char可能不是衡量最大RAM吞吐量的理想数据类型.我怀疑使用32位或64位数据类型,您可以通过相同数量的cpu周期获得更多数据.
  • 更一般地说,您还应该检查您的测量不受CPU速度的限制,而是受RAM速度的限制.例如,ramspeed实用程序在源代码中显式地展开内部循环:

    for(i = 0; i < blk/sizeof(UTL); i += 32) {
        b[i] = a[i];        b[i+1] = a[i+1];
        ...
        b[i+30] = a[i+30];  b[i+31] = a[i+31];
    }
    
    Run Code Online (Sandbox Code Playgroud)

    编辑:在支持的体系结构上ramsmp实际上甚至使用"手写"汇编代码来实现这些循环

  • L1/L2/L3高速缓存效果:以GByte/s为单位测量带宽作为块大小的函数是有益的.在增加与从(缓存或主存储器)读取数据的位置相对应的块大小时,您应该看到大致四种不同的速度.您的处理器似乎有8 MB的Level3(?)缓存,因此您的1000万字节可能主要保留在L3缓存中(在一个处理器的所有内核之间共享).

  • 内存通道:您的处理器有3个内存通道.如果您的存储库已安装,您可以全部利用它们(参见例如主板的手册),您可能希望同时运行多个线程.我看到的效果是,当只用一个线程读取时,渐近带宽接近单个内存模块的一个(例如DDR-1600为12.8 GByte/s),而在运行多个线程时,渐近带宽接近数字内存通道的时间乘以单个内存模块的带宽.


Ris*_*hiD 5

您还可以使用 numactl 来选择在哪个节点上运行进程以及从哪里分配内存:

numactl --cpubind=0 --membind=1 <process>
Run Code Online (Sandbox Code Playgroud)

我将其与LMbench结合使用来获取内存延迟数:

numactl --cpubind=0 --membind=0  ./lat_mem_rd -t 512
numactl --cpubind=0 --membind=1  ./lat_mem_rd -t 512
Run Code Online (Sandbox Code Playgroud)