NVIDIA-SMI因为无法与NVIDIA驱动程序通信而失败

dbl*_*001 19 gpu

我正在使用Ubuntu 14.04 LTS运行AWS EC2 g2.2xlarge实例.我想在训练我的TensorFlow模型时观察GPU的利用率.我试图运行'nvidia-smi'时遇到错误.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
Run Code Online (Sandbox Code Playgroud)

我按照这些说明安装了CUDA 7和cuDNN:

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot
Run Code Online (Sandbox Code Playgroud)

================================================== =====================

重新启动后,通过运行'$ sudo update-initramfs -u'来更新initramfs

现在,请编辑/etc/modprobe.d/blacklist.conf文件以将黑名单列入黑名单.在编辑器中打开文件,并在文件末尾插入以下行.

blacklist nouveau blacklist lbm-nouveau options nouveau modeset = 0 alias nouveau off alias lbm-nouveau off

保存并退出文件.

现在安装构建必备工具并更新initramfs并重新启动,如下所示:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot
Run Code Online (Sandbox Code Playgroud)

================================================== ======================

重新启动后,运行以下命令安装Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot
Run Code Online (Sandbox Code Playgroud)

================================================== ======================

现在系统已启动,请运行以下命令验证安装.

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`
Run Code Online (Sandbox Code Playgroud)

您应该看到像'nvidia.png'这样的输出.

现在运行以下命令.$

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery
Run Code Online (Sandbox Code Playgroud)

但是,'nvidia-smi'仍然没有显示GPU活动,而Tensorflow是训练模型:

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W / 125W |     10MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

nui*_*cca 25

我通过从BIOS禁用安全启动控制,解决了"我的ASUS笔记本电脑与GTX 950m和Ubuntu 18.04无法与NVIDIA驱动程序通信的NVIDIA-SMI失败".

  • 对我有用。现在我又可以使用CUDA了。 (2认同)

小智 9

运行以下命令以获取正确的 NVIDIA 驱动程序:

sudo ubuntu-drivers devices
Run Code Online (Sandbox Code Playgroud)

然后选择右边并运行:

sudo apt install <version>
Run Code Online (Sandbox Code Playgroud)


ken*_*orb 9

仔细检查您是否拥有设备的正确权限,/dev/nvidiactl或者它是否确实存在。

$ strace nvidia-smi
...
openat(AT_FDCWD, "/dev/nvidiactl", O_RDONLY) = -1 ENOENT (No such file or directory)
Run Code Online (Sandbox Code Playgroud)

确保nvidia-persistenced服务已安装、启动并运行:

nvidia-persistenced --version
sudo systemctl start nvidia-persistenced
sudo systemctl status nvidia-persistenced
tail /var/log/syslog # When failed.
journalctl -xeu nvidia-persistenced.service
Run Code Online (Sandbox Code Playgroud)

请参阅:谁创建了 /dev/nvidia0 和 /dev/nvidiactl?

您可以尝试通过以下方式手动创建设备:

sudo modprobe -abq nvidia
sudo nvidia-modprobe -c 0 -u
nvidia-smi -L
Run Code Online (Sandbox Code Playgroud)

nvidia-persistenced就我而言,重新启动服务后,系统日志中出现以下错误:

NVRM:X 设备未调用 NVIDIA 探测例程。当加载nouveau 、 rivafb、 nvidiafb 或 rivatv 等驱动程序并获得 NVIDIA 设备的所有权时,可能会发生这种情况。尝试卸载冲突的内核模块(和/或重新配置内核而不使用冲突的驱动程序)。

解决方案是将以下行nouveau添加到文件中,将驱动程序列入黑名单/etc/modprobe.d/blacklist.conf

# Blacklist nouveau.
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Run Code Online (Sandbox Code Playgroud)

然后是reboot系统。

请参阅:如何删除 Nouveau 内核驱动程序(修复 Nvidia 安装错误)


Rab*_*ndi 8

我正在使用 AWS DeepAMI P2 实例,突然发现 Nvidia-driver 命令不起作用,并且找不到 GPU 火炬或 tensorflow 库。然后我通过以下方式解决了问题,

nvcc --version如果不起作用就运行

然后运行以下命令

apt install nvidia-cuda-toolkit

希望这能解决问题。

  • 这对我有用。就我而言,需要重新启动才能使 nvidia-smi 再次工作。 (2认同)

Hea*_*ify 7

我在使用K80 GPU的Google Compute Engine中的Ubuntu 16.04(Linux 4.14内核)上遇到了同样的错误.我将内核升级到4.14并且问题解决了.以下是我将Linux内核从4.13升级到4.14的方法:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a
Run Code Online (Sandbox Code Playgroud)

您应该看到您的内核已经升级,并且希望nvidia-smi能够正常工作.


Vad*_*imK 6

就我而言,上述解决方案都没有帮助:

根本原因:gcc 版本不兼容

解决方案:

1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3  sudo apt install nvidia-driver-450 
4. sudo reboot
Run Code Online (Sandbox Code Playgroud)

系统:AWS EC2 18.04 实例

解决方案来源:https : //forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4

  • 我的机器在更新后突然停止显示 NVIDIA 卡。这帮助我修复了。谢谢 (2认同)

小智 5

我的系统版本:ubuntu 20.04 LTS。

  • 我通过生成一个新的 MOK 并将其注册到 shim 中解决了这个问题。

  • 无需禁用安全启动,尽管它也确实对我有用。

  • 只需执行此命令并按照其建议进行操作即可:

    sudo update-secureboot-policy --enroll-key
    
    Run Code Online (Sandbox Code Playgroud)

根据 ubuntu 的 wiki: How can I do non-automatic Signing of drivers

  • 它说没有找到莫克。 (2认同)

use*_*160 5

对于 Ubuntu 20.04 或更高版本,请尝试安装 NVIDIA 驱动程序:

sudo ubuntu-drivers autoinstall
Run Code Online (Sandbox Code Playgroud)

然后

sudo reboot
Run Code Online (Sandbox Code Playgroud)

根据这些说明:

https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-20-04-focal-fossa-linux

如果您收到类似以下错误:

sudo: ubuntu-drivers: command not found
Run Code Online (Sandbox Code Playgroud)

那么你可能需要先安装:

sudo apt-get install ubuntu-drivers-common
Run Code Online (Sandbox Code Playgroud)


dbl*_*001 0

我必须在 g2.2xlarge Ubuntu 14.04LTS 实例上安装 NVIDIA 367.57 驱动程序和带有 Tensorflow 的 CUDA 7.5。例如 nvidia-graphics-drivers-367_367.57.orig.tar

现在,当我训练张量流模型时,GRID K520 GPU 正在工作:

ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr  1 18:03:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   39C    P8    43W / 125W |   3800MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2254    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version / Runtime Version          8.0 / 7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4036 MBytes (4232052736 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
Run Code Online (Sandbox Code Playgroud)