How to run tensorflow with gpu support in docker-compose?

Kev*_*sen 21 gpu nvidia docker docker-compose tensorflow

I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:

# Start the services
sudo docker-compose -f docker-compose-test.yml up --build

Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1   ... done
Recreating vw_image_cls_tensorflow_1  ... error

ERROR: for vw_image_cls_tensorflow_1  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown

ERROR: for tensorflow  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.

Run Code Online (Sandbox Code Playgroud)

My docker-compose file looks as follows:

# version 2.3 is required for NVIDIA runtime
version: '2.3'

services:
  nvidia-driver:
    # NVIDIA GPU driver used by the CUDA Toolkit
    image: nvidia/driver:440.33.01-ubuntu18.04
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
    # Do we need this volume to make the driver accessible by other containers in the network?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
    networks:
      - net

  nvidia-cuda:
    depends_on:
      - nvidia-driver
    image: nvidia/cuda:10.1-base-ubuntu18.04
    volumes:
    # Do we need the driver volume here?
     - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
     # Do we need to create an additional volume for this service to be accessible by the tensorflow service?
    devices:
      # Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
    networks:
      - net

  tensorflow:
    image: tensorflow/tensorflow:2.0.1-gpu  # Does this ship with cuda10.0 installed or do I need a separate container for it?
    runtime: nvidia
    restart: always
    privileged: true
    depends_on:
      - nvidia-cuda
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      # Volumes related to source code and config files
      - ./src:/src
      - ./configs:/configs
      # Do we need the driver volume here?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      # Do we need an additional volume from the nvidia-cuda service?
    command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
    devices:
      # Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
      - /dev/nvidia-uvm-tools
    networks:
      - net

volumes:
  nvidia_driver:

networks:
  net:
    driver: bridge

Run Code Online (Sandbox Code Playgroud)

And my /etc/docker/daemon.json file looks as follows:

{"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:

Is it actually possible to do what I am trying to do?
If yes, did I setup my docker-compose file correctly (see comments in docker-compose.yml)?
How do I fix the error message I received above?

Thank you very much for your help, I highly appreciate it.

我同意安装所有tensorflow-gpu依赖项是相当痛苦的。幸运的是，使用 Docker 非常简单，因为您只需要 NVIDIA 驱动程序和NVIDIA 容器工具包（一种插件）。其余 ( CUDA, cuDNN) Tensorflow 映像已在内部，因此您不需要在 Docker 主机上使用它们。

该驱动程序也可以部署为容器，但我不建议将其用于工作站。它适用于没有 GUI 的服务器（X 服务器等）。容器化驱动程序的主题已在本文末尾介绍，现在让我们看看如何开始tensorflow-gpu使用docker-compose. 无论容器中是否有驱动程序，该过程都是相同的。

如何使用 docker-compose 启动 Tensorflow-GPU

先决条件：

要为容器启用 GPU 支持，您需要使用 NVIDIA Container Toolkit 创建容器。有两种方法可以做到这一点：

您可以将 Docker 配置为始终使用nvidia容器运行时。这样做很好，因为它就像默认运行时一样工作，除非存在一些特定于 NVIDIA 的环境变量（稍后会详细介绍）。这是通过放入"default-runtime": "nvidia"Docker 中来完成的daemon.json：

/etc/docker/daemon.json：

{
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  },
  "default-runtime": "nvidia"
}

Run Code Online (Sandbox Code Playgroud)

您可以在创建容器时选择运行时。只有使用docker-compose格式版本才可能2.3。

docker-compose.yml以下是使用 GPU 启动 Tensorflow 的示例：

version: "2.3"  # the only version where 'runtime' option is supported

services:
  test:
    image: tensorflow/tensorflow:2.3.0-gpu
    # Make Docker create the container with NVIDIA Container Toolkit
    # You don't need it if you set 'nvidia' as the default runtime in
    # daemon.json.
    runtime: nvidia
    # the lines below are here just to test that TF can see GPUs
    entrypoint:
      - /usr/local/bin/python
      - -c
    command:
      - "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"

Run Code Online (Sandbox Code Playgroud)

通过运行它，docker-compose up您应该会看到一条包含 GPU 规格的行。它出现在最后，看起来像这样：

测试_1 | 2021-01-23 11:02:46.500189：我tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]创建了TensorFlow设备（/device:GPU:0，具有1624 MB内存）->物理GPU（设备：0，名称：GeForce GTX 1050，pci 总线 ID：0000:01:00.0，计算能力：6.1)

这就是使用 GPU 启动官方 Tensorflow 图像所需的全部内容。

NVIDIA 环境变量和自定义映像

正如我所提到的，除非存在某些变量，否则 NVIDIA Container Toolkit 将作为默认运行时运行。此处列出并解释了这些内容。仅当您构建自定义映像并希望在其中启用 GPU 支持时，您才需要关心它们。具有 GPU 的官方 Tensorflow 图像继承自CUDA使用基础图像，因此您只需使用正确的运行时启动图像，如上例所示。

如果您对自定义 Tensorflow 图像感兴趣，我就此撰写了另一篇文章。

容器中 NVIDIA 驱动程序的主机配置

正如一开始提到的，这不是您想要在工作站上出现的情况。该过程要求您在没有加载其他显示驱动程序时启动驱动程序容器（例如通过 SSH）。此外，在撰写本文时仅支持 Ubuntu 16.04、Ubuntu 18.04 和 Centos 7。

有一个官方指南，下面是 Ubuntu 18.04 的摘录。

编辑 NVIDIA Container Toolkit 设置中的“root”选项：

sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml

Run Code Online (Sandbox Code Playgroud)

禁用 Nouveau 驱动程序模块：

sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
  && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
  && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"

Run Code Online (Sandbox Code Playgroud)

如果您使用 AWS 内核，请确保i2c_core启用内核模块：

sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"

Run Code Online (Sandbox Code Playgroud)

更新initramfs：

sudo update-initramfs -u

Run Code Online (Sandbox Code Playgroud)

现在是时候重新启动以使更改生效了。重新启动后检查是否未nouveau加载nvidia模块。以下命令不应返回任何内容：

lsmod | grep nouveau
lsmod | grep nvidia

Run Code Online (Sandbox Code Playgroud)

在容器中启动驱动程序

该指南提供了运行驱动程序的命令，我更喜欢docker-compose。将以下内容另存为driver.yml：

version: "3.0"
services:
  driver:
    image: nvidia/driver:450.80.02-ubuntu18.04
    privileged: true
    restart: unless-stopped
    volumes:
    - /run/nvidia:/run/nvidia:shared
    - /var/log:/var/log
    pid: "host"
    container_name: nvidia-driver

Run Code Online (Sandbox Code Playgroud)

用于docker-compose -f driver.yml up -d启动驱动程序容器。为您的内核编译模块需要几分钟的时间。您可以用来docker logs nvidia-driver -f概述该过程，等待“完成，现在等待信号”行出现。否则用于lsmod | grep nvidia查看驱动程序模块是否已加载。准备就绪后，您应该会看到如下内容：

nvidia_modeset       1183744  0
nvidia_uvm            970752  0
nvidia              19722240  17 nvidia_uvm,nvidia_modeset

Run Code Online (Sandbox Code Playgroud)

Docker Compose v1.27.0+

自 2022 年起版本 3.x

version: "3.6"
services:

  jupyter-8888:
    image: "tensorflow/tensorflow:latest-gpu-jupyter"
    env_file: "env-file"
    deploy:
      resources:
        reservations:
          devices:
          - driver: "nvidia"
            device_ids: ["0"]
            capabilities: [gpu]
    ports:
      - 8880:8888
    volumes:
      - workspace:/workspace
      - data:/data

Run Code Online (Sandbox Code Playgroud)

如果你想指定不同的GPU id，例如。0 和 3

device_ids: ['0', '3']

Run Code Online (Sandbox Code Playgroud)

通过在我的 Windows 计算机上安装 WSL2 来使用 VS Code 以及 Remote-Containers 扩展，设法使其正常工作。以下是对安装 WSL2 以及在其中使用 VS Code 有很大帮助的文章集：

https://learn.microsoft.com/en-us/windows/wsl/install-win10
ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2
https://code.visualstudio.com/docs/remote/containers

使用 VS Code 的远程容器扩展，您可以基于 docker-compose 文件（或者像我一样只是一个 Dockerfile）来设置 devcontainer，这在上面的第三个链接中可能有更好的解释。我自己要记住的一件事是，在定义文件时，.devcontainer.json您需要确保设置

// Optional arguments passed to ``docker run ... ``
    "runArgs": [
        "--gpus", "all"
    ]

Run Code Online (Sandbox Code Playgroud)

在使用 VS Code 之前，我已经使用 Pycharm 很长时间了，所以一开始切换到 VS Code 相当痛苦，但是 VS Code 与 WSL2、远程容器和 pylance 扩展一起使得在具有 GPU 支持的容器。据我所知，Pycharcm 不支持在 WSL atm 的容器内进行调试，因为

归档时间：	5 年，11 月前
查看次数：	11671 次
最近记录：	4 年，7 月前

在 python docker 镜像上使用 GPU 6

更多相关链接

是否可以只安装docker cli而不是守护进程 43

AttributeError:'module'对象没有tf.app.run()的属性'main' 13

Docker 卷挂载和权限：主机 (33) 上的 www-data 在 Alpine Linux 中变为 xfs (33) 9

如何在Bitbucket Pipelines中缓存APT包？ 7

真正的非零预测的损失惩罚更高 7

如何使用 Dockerized SAM Local 设置调试 6

/var/run/docker.sock 如何为 Windows Docker 工作？ 6

张量流的 flat_map + window.batch() 对数据集/数组做了什么？ 6

如何使用docker-compose在docker / container外部公开容器端口？ 5

使用nvcc编译器使用-G参数进行编译时,GPU性能不佳 0

JavaScript中使用"严格"做什么,背后的原因是什么？ 7339

如何在Bash中连接字符串变量 2624

为什么在单独的循环中元素添加比在组合循环中快得多？ 2175

如何更改一个特定提交的提交作者？ 1949

Memcached与Redis？ 1398

在C#中调用基础构造函数 1398

在对象名称之前单个和双下划线的含义是什么？ 1205

Subversion存储库中"分支","标记"和"主干"的含义是什么？ 1181

如何将本地jar文件添加到Maven项目？ 1053

返回IEnumerable <T>与IQueryable <T> 1051