How to run tensorflow with gpu support in docker-compose?

Kev*_*sen 21 gpu nvidia docker docker-compose tensorflow

I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:

# Start the services
sudo docker-compose -f docker-compose-test.yml up --build

Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1   ... done
Recreating vw_image_cls_tensorflow_1  ... error

ERROR: for vw_image_cls_tensorflow_1  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown

ERROR: for tensorflow  Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
Run Code Online (Sandbox Code Playgroud)

My docker-compose file looks as follows:

# version 2.3 is required for NVIDIA runtime
version: '2.3'

services:
  nvidia-driver:
    # NVIDIA GPU driver used by the CUDA Toolkit
    image: nvidia/driver:440.33.01-ubuntu18.04
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
    # Do we need this volume to make the driver accessible by other containers in the network?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
    networks:
      - net

  nvidia-cuda:
    depends_on:
      - nvidia-driver
    image: nvidia/cuda:10.1-base-ubuntu18.04
    volumes:
    # Do we need the driver volume here?
     - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
     # Do we need to create an additional volume for this service to be accessible by the tensorflow service?
    devices:
      # Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
    networks:
      - net

  tensorflow:
    image: tensorflow/tensorflow:2.0.1-gpu  # Does this ship with cuda10.0 installed or do I need a separate container for it?
    runtime: nvidia
    restart: always
    privileged: true
    depends_on:
      - nvidia-cuda
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    volumes:
      # Volumes related to source code and config files
      - ./src:/src
      - ./configs:/configs
      # Do we need the driver volume here?
      - nvidia_driver:/usr/local/nvidai/:ro  # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      # Do we need an additional volume from the nvidia-cuda service?
    command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
    devices:
      # Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
      - /dev/nvidiactl
      - /dev/nvidia-uvm
      - /dev/nvidia0
      - /dev/nvidia-uvm-tools
    networks:
      - net

volumes:
  nvidia_driver:

networks:
  net:
    driver: bridge
Run Code Online (Sandbox Code Playgroud)

And my /etc/docker/daemon.json file looks as follows:

{"default-runtime":"nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:

  1. Is it actually possible to do what I am trying to do?
  2. If yes, did I setup my docker-compose file correctly (see comments in docker-compose.yml)?
  3. How do I fix the error message I received above?

Thank you very much for your help, I highly appreciate it.

ane*_*yte 8

我同意安装所有tensorflow-gpu依赖项是相当痛苦的。幸运的是,使用 Docker 非常简单,因为您只需要 NVIDIA 驱动程序NVIDIA 容器工具包(一种插件)。其余 ( CUDA, cuDNN) Tensorflow 映像已在内部,因此您不需要在 Docker 主机上使用它们。

该驱动程序也可以部署为容器,但我不建议将其用于工作站。它适用于没有 GUI 的服务器(X 服务器等)。容器化驱动程序的主题已在本文末尾介绍,现在让我们看看如何开始tensorflow-gpu使用docker-compose. 无论容器中是否有驱动程序,该过程都是相同的。

如何使用 docker-compose 启动 Tensorflow-GPU

先决条件:

要为容器启用 GPU 支持,您需要使用 NVIDIA Container Toolkit 创建容器。有两种方法可以做到这一点:

  1. 您可以将 Docker 配置为始终使用nvidia容器运行时。这样做很好,因为它就像默认运行时一样工作,除非存在一些特定于 NVIDIA 的环境变量(稍后会详细介绍)。这是通过放入"default-runtime": "nvidia"Docker 中来完成的daemon.json

/etc/docker/daemon.json

{
  "runtimes": {
      "nvidia": {
          "path": "/usr/bin/nvidia-container-runtime",
          "runtimeArgs": []
      }
  },
  "default-runtime": "nvidia"
}
Run Code Online (Sandbox Code Playgroud)
  1. 您可以在创建容器时选择运行时。只有使用docker-compose格式版本才可能2.3

docker-compose.yml以下是使用 GPU 启动 Tensorflow 的示例:

version: "2.3"  # the only version where 'runtime' option is supported

services:
  test:
    image: tensorflow/tensorflow:2.3.0-gpu
    # Make Docker create the container with NVIDIA Container Toolkit
    # You don't need it if you set 'nvidia' as the default runtime in
    # daemon.json.
    runtime: nvidia
    # the lines below are here just to test that TF can see GPUs
    entrypoint:
      - /usr/local/bin/python
      - -c
    command:
      - "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
Run Code Online (Sandbox Code Playgroud)

通过运行它,docker-compose up您应该会看到一条包含 GPU 规格的行。它出现在最后,看起来像这样:

测试_1 | 2021-01-23 11:02:46.500189:我tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]创建了TensorFlow设备(/device:GPU:0,具有1624 MB内存)->物理GPU(设备:0,名称:GeForce GTX 1050,pci 总线 ID:0000:01:00.0,计算能力:6.1)

这就是使用 GPU 启动官方 Tensorflow 图像所需的全部内容。

NVIDIA 环境变量和自定义映像

正如我所提到的,除非存在某些变量,否则 NVIDIA Container Toolkit 将作为默认运行时运行。此处列出并解释了这些内容。仅当您构建自定义映像并希望在其中启用 GPU 支持时,您才需要关心它们。具有 GPU 的官方 Tensorflow 图像继承自CUDA使用基础图像,因此您只需使用正确的运行时启动图像,如上例所示。

如果您对自定义 Tensorflow 图像感兴趣,我就此撰写了另一篇文章。

容器中 NVIDIA 驱动程序的主机配置

正如一开始提到的,这不是您想要在工作站上出现的情况。该过程要求您在没有加载其他显示驱动程序时启动驱动程序容器(例如通过 SSH)。此外,在撰写本文时仅支持 Ubuntu 16.04、Ubuntu 18.04 和 Centos 7。

有一个官方指南,下面是 Ubuntu 18.04 的摘录。

  1. 编辑 NVIDIA Container Toolkit 设置中的“root”选项:
sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
Run Code Online (Sandbox Code Playgroud)
  1. 禁用 Nouveau 驱动程序模块:
sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
  && sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
  && sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
Run Code Online (Sandbox Code Playgroud)

如果您使用 AWS 内核,请确保i2c_core启用内核模块:

sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"
Run Code Online (Sandbox Code Playgroud)
  1. 更新initramfs
sudo update-initramfs -u
Run Code Online (Sandbox Code Playgroud)

现在是时候重新启动以使更改生效了。重新启动后检查是否未nouveau加载nvidia模块。以下命令不应返回任何内容:

lsmod | grep nouveau
lsmod | grep nvidia
Run Code Online (Sandbox Code Playgroud)

在容器中启动驱动程序

指南提供了运行驱动程序的命令,我更喜欢docker-compose。将以下内容另存为driver.yml

version: "3.0"
services:
  driver:
    image: nvidia/driver:450.80.02-ubuntu18.04
    privileged: true
    restart: unless-stopped
    volumes:
    - /run/nvidia:/run/nvidia:shared
    - /var/log:/var/log
    pid: "host"
    container_name: nvidia-driver
Run Code Online (Sandbox Code Playgroud)

用于docker-compose -f driver.yml up -d启动驱动程序容器。为您的内核编译模块需要几分钟的时间。您可以用来docker logs nvidia-driver -f概述该过程,等待“完成,现在等待信号”行出现。否则用于lsmod | grep nvidia查看驱动程序模块是否已加载。准备就绪后,您应该会看到如下内容:

nvidia_modeset       1183744  0
nvidia_uvm            970752  0
nvidia              19722240  17 nvidia_uvm,nvidia_modeset
Run Code Online (Sandbox Code Playgroud)


ven*_*iac 6

Docker Compose v1.27.0+

自 2022 年起版本 3.x

version: "3.6"
services:

  jupyter-8888:
    image: "tensorflow/tensorflow:latest-gpu-jupyter"
    env_file: "env-file"
    deploy:
      resources:
        reservations:
          devices:
          - driver: "nvidia"
            device_ids: ["0"]
            capabilities: [gpu]
    ports:
      - 8880:8888
    volumes:
      - workspace:/workspace
      - data:/data
Run Code Online (Sandbox Code Playgroud)

如果你想指定不同的GPU id,例如。0 和 3

device_ids: ['0', '3']
Run Code Online (Sandbox Code Playgroud)


Kev*_*sen 1

通过在我的 Windows 计算机上安装 WSL2 来使用 VS Code 以及 Remote-Containers 扩展,设法使其正常工作。以下是对安装 WSL2 以及在其中使用 VS Code 有很大帮助的文章集:

使用 VS Code 的远程容器扩展,您可以基于 docker-compose 文件(或者像我一样只是一个 Dockerfile)来设置 devcontainer,这在上面的第三个链接中可能有更好的解释。我自己要记住的一件事是,在定义文件时,.devcontainer.json您需要确保设置

// Optional arguments passed to ``docker run ... ``
    "runArgs": [
        "--gpus", "all"
    ]
Run Code Online (Sandbox Code Playgroud)

在使用 VS Code 之前,我已经使用 Pycharm 很长时间了,所以一开始切换到 VS Code 相当痛苦,但是 VS Code 与 WSL2、远程容器和 pylance 扩展一起使得在具有 GPU 支持的容器。据我所知,Pycharcm 不支持在 WSL atm 的容器内进行调试,因为