如何持久设置（NVIDIA）GPU的NUMA节点？

nor*_*ius 7 nvidia gpu drivers 20.04 gpu-driver

我正在运行配备 AMD CPU (EPYC 7H12) 和 Nvidia GPU (RTX 3090) 的工作站。该系统运行在Ubuntu 20.04上。使用张量流时，我反复收到警告，正如相关SO 问题中所述。

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n

Run Code Online (Sandbox Code Playgroud)\n

答案建议识别 GPU 的 PCI 总线 ID，然后将该设备的 numa_node 设置设置为 0。在我的例子中，以下方法有效。使用以下命令识别 PCI-ID 后lspci | grep NVIDIA：

I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero\n

Run Code Online (Sandbox Code Playgroud)\n

然而，这只是一个肤浅的修复。首先，每次系统重新启动时，numa_node 设置都会重置（值为-1）。其次，Nvidia 驱动程序似乎忽略了这个标志，因为nvidia-smi（Nvidia 的驱动程序管理工具）仍然显示：

# 1) Identify the PCI-ID of the GPU (with domain ID)\n#    In my case: PCI_ID="0000.81:00.0"\nlspci -D | grep NVIDIA\n# 2) Write the NUMA affinity to the device\'s numa_node file.\necho 0 | sudo tee -a "/sys/bus/pci/devices/<PCI_ID>/numa_node"\n

Run Code Online (Sandbox Code Playgroud)\n

如何持久指定 GPU 的 NUMA 关联性？这是 Nvidia 驱动程序、Ubuntu 还是 BIOS 的配置？我知道 Linux 内核支持 NUMA，但我发现很难找到有关如何配置它的资源。

更新：我添加了一个 crontab 作为 root，这可以更持久地解决问题。然而，修复仍然是“肤浅的”，因为 Nvidia 驱动程序没有意识到这一点。

nvidia-smi topo -m\n#\n#       GPU0  CPU Affinity    NUMA Affinity\n# GPU0     X  0-127           N/A\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	3 年，11 月前
查看次数：	3877 次
最近记录：	2 年，8 月前