Docker 仅在重新安装后适用于 Nvidia 驱动程序

ccl*_*l13 6 nvidia cuda docker 20.04

Ubuntu 版本 20.04 LTS

NVIDIA驱动以及cuda等相关包均已正确安装。运行 nvidia-smi 和 cuda 代码正常。

Docker 相关的 NVIDIA 软件包也已安装(NVIDIA Container Toolkit)。最初的问题是,如果我尝试在 docker 中验证 NVIDIA 支持,则会收到以下错误消息:

$ sudo docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Run Code Online (Sandbox Code Playgroud)

在找到一些在线讨论后,我尝试按照此处的说明重新安装 docker: https: //docs.docker.com/engine/install/ubuntu/ 它对我有用。NVIDIA 现在在 docker 下工作。

但是,重新启动后,它将停止工作。我将不得不做类似的事情:

$ sudo apt-get reinstall docker-ce docker-ce-cli containerd.io
Run Code Online (Sandbox Code Playgroud)

让 NVIDIA 再次在 docker 下工作。可以确认每次重新启动都会导致此问题。

如何让它工作,这样我就不必每次重新启动后都重新安装?

小智 7

就我而言,我通过 snap 和 apt 包管理器安装了 docker 两次:

重新启动后我有:

$ docker images
REPOSITORY              TAG                  IMAGE ID            CREATED             SIZE
ubuntu                  latest               4e2eef94cd6b        3 weeks ago         73.9MB
tensorflow/tensorflow   latest-gpu-jupyter   f0b0261fec71        6 weeks ago         3.3GB
nvidia/cuda             10.0-base            841d44dd4b3c        9 months ago        110MB
Run Code Online (Sandbox Code Playgroud)

如果我重新启动 docker 服务:

$ sudo service docker restart
Run Code Online (Sandbox Code Playgroud)

我还有其他一组图像:

$ docker images
REPOSITORY              TAG                  IMAGE ID            CREATED             SIZE
jupyter/r-notebook      latest               14611e3d9838        2 weeks ago         2.59GB
ubuntu                  latest               4e2eef94cd6b        3 weeks ago         73.9MB
tensorflow/tensorflow   latest-gpu-jupyter   f0b0261fec71        6 weeks ago         3.3GB

$ dpkg -l | grep docker
ii  docker-ce                                  5:19.03.12~3-0~ubuntu-focal           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                              5:19.03.12~3-0~ubuntu-focal           amd64        Docker CLI: the open-source application container engine

$ snap list | grep docker
docker     19.03.11     471    latest/stable  canonical*          -    
Run Code Online (Sandbox Code Playgroud)

我重新启动操作系统:

$ sudo init 6
Run Code Online (Sandbox Code Playgroud)

我删除了通过 snap docker 创建的所有图像:

$ docker rmi $(docker images -q)
Run Code Online (Sandbox Code Playgroud)

之后我删除了 snap docker:

$ sudo snap remove docker
$ sudo init 6
Run Code Online (Sandbox Code Playgroud)

现在我有一个可用的 docker 服务:

$ docker run --gpus all -p 8888:8888 -v /tf:/tf -w /tf --name tfgpu --rm tensorflow/tensorflow:latest-gpu-jupyter
[I 07:52:52.707 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 07:52:52.967 NotebookApp] Serving notebooks from local directory: /tf
[I 07:52:52.967 NotebookApp] The Jupyter Notebook is running at:
[I 07:52:52.967 NotebookApp] http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp]  or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
[I 07:52:52.967 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 07:52:52.972 NotebookApp] 

    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
    Or copy and paste one of these URLs:
        http://a1d1932a7004:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
     or http://127.0.0.1:8888/?token=74b0b061e2a1818b865c1f344be904758f9f0dba73b742d3
Run Code Online (Sandbox Code Playgroud)