无法运行查询NVML的CUDA代码-有关libnvidia-ml.so的错误

Bri*_*n R 5 cuda nvcc tesla nvml

最近,一位同事需要使用NVML查询设备信息,因此我下载了Tesla开发工具包3.304.5,并将文件nvml.h复制到了/ usr / include。为了进行测试,我在tdk_3.304.5 / nvml / example中编译了示例代码,并且工作正常。

整个周末,系统中发生了某些更改(我无法确定更改的内容,而且我不是唯一有权访问计算机的更改),现在使用nvml.h的任何代码(例如示例代码)都会失败,并出现以下错误:

Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:

You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Run Code Online (Sandbox Code Playgroud)

但是,我仍然可以运行nvidia-smi并读取有关我的K20m状态的信息,据我所知,nvidia-smi只是对nvml.h的一组调用。我收到的错误消息有些含糊,但我相信它告诉我nvidia-ml.so文件需要与我在系统上安装的Tesla驱动程序匹配。为了确保一切正确,我重新下载了CUDA 5.0并安装了驱动程序,CUDA运行时和测试文件。我确定nvidia-ml.so文件与驱动程序匹配(均为304.54),所以对于可能出了什么问题我感到很困惑。我可以使用nvcc编译和运行测试代码,也可以运行自己的CUDA代码,只要它不包含nvml.h。

有没有人遇到此错误或对纠正此问题有任何想法?

$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root     17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root     22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54

$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root     17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root     22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54

$ cat /proc/driver/nvidia/version 
NVRM version: NVIDIA UNIX x86_64 Kernel Module  304.54  Sat Sep 29 00:05:49 PDT 2012
GCC version:  gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) 

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

$ whereis nvml.h
nvml: /usr/include/nvml.h

$ ldd example
        linux-vdso.so.1 =>  (0x00007fff2da66000)
        libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
        libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
        /lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
Run Code Online (Sandbox Code Playgroud)

编辑:解决方案是删除libnvidia-ml.so的所有其他实例。由于某种原因,其中有很多。

$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1
Run Code Online (Sandbox Code Playgroud)

Rob*_*lla 5

您收到此错误是因为尝试使用nvml的应用程序正在加载位于以下位置的存根库:

...tdk_install_path/lib64/libnvidia-ml.so
Run Code Online (Sandbox Code Playgroud)

而不是:

/usr/lib64/libnvidia-ml.so
Run Code Online (Sandbox Code Playgroud)

当我将存根库路径添加到LD_LIBRARY_PATH环境变量时,我能够重现您的错误。因此,如果有人将tdk分发随附的存根库的路径添加到您的LD_LIBRARY_PATH环境变量中,那么这可能是错误的来源,但可能并非唯一的可能。如果有人以不寻常的方式将存根库复制到某个系统路径,那也可能是一个问题。

您需要尝试弄清楚为什么系统要加载该存根库来代替in中正确的存根库/usr/lib64。另外,对于发现的目的,你可以尝试删除系统上的存根库的任何地方的所有实例(留下正确的图书馆/usr/lib/usr/lib64单独的),你应该能够观察到正确的行为。