Kev*_*sen 21 gpu nvidia docker docker-compose tensorflow
I want to create some neural network in tensorflow 2.x that trains on a GPU and I want to set up all the necessary infrastructure inside a docker-compose network (assuming that this is actually possible for now). As far as I know, in order to train a tensorflow model on a GPU, I need the CUDA toolkit and the NVIDIA driver. To install these dependencies natively on my computer (OS: Ubuntu 18.04) is always quite a pain, as there are many version dependencies between tensorflow, CUDA and the NVIDIA driver. So, I was trying to find a way how to create a docker-compose file that contains a service for tensorflow, CUDA and the NVIDIA driver, but I am getting the following error:
# Start the services
sudo docker-compose -f docker-compose-test.yml up --build
Starting vw_image_cls_nvidia-driver_1 ... done
Starting vw_image_cls_nvidia-cuda_1 ... done
Recreating vw_image_cls_tensorflow_1 ... error
ERROR: for vw_image_cls_tensorflow_1 Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: for tensorflow Cannot start service tensorflow: OCI runtime create failed: container_linux.go:346: starting container process caused "exec: \"import\": executable file not found in $PATH": unknown
ERROR: Encountered errors while bringing up the project.
Run Code Online (Sandbox Code Playgroud)
My docker-compose file looks as follows:
# version 2.3 is required for NVIDIA runtime
version: '2.3'
services:
nvidia-driver:
# NVIDIA GPU driver used by the CUDA Toolkit
image: nvidia/driver:440.33.01-ubuntu18.04
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Do we need this volume to make the driver accessible by other containers in the network?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
networks:
- net
nvidia-cuda:
depends_on:
- nvidia-driver
image: nvidia/cuda:10.1-base-ubuntu18.04
volumes:
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need to create an additional volume for this service to be accessible by the tensorflow service?
devices:
# Do we need to list the devices here, or only in the tensorflow service. Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
networks:
- net
tensorflow:
image: tensorflow/tensorflow:2.0.1-gpu # Does this ship with cuda10.0 installed or do I need a separate container for it?
runtime: nvidia
restart: always
privileged: true
depends_on:
- nvidia-cuda
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes:
# Volumes related to source code and config files
- ./src:/src
- ./configs:/configs
# Do we need the driver volume here?
- nvidia_driver:/usr/local/nvidai/:ro # Taken from here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
# Do we need an additional volume from the nvidia-cuda service?
command: import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000]))); print("SUCCESS")
devices:
# Devices listed here: http://collabnix.com/deploying-application-in-the-gpu-accelerated-data-center-using-docker/
- /dev/nvidiactl
- /dev/nvidia-uvm
- /dev/nvidia0
- /dev/nvidia-uvm-tools
networks:
- net
volumes:
nvidia_driver:
networks:
net:
driver: bridge
Run Code Online (Sandbox Code Playgroud)
And my /etc/docker/daemon.json file looks as follows:
{"default-runtime":"nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Run Code Online (Sandbox Code Playgroud)
So, it seems like the error is somehow related to configuring the nvidia runtime, but more importantly, I am almost certain that I didn't set up my docker-compose file correctly. So, my questions are:
docker-compose.yml)?Thank you very much for your help, I highly appreciate it.
我同意安装所有tensorflow-gpu依赖项是相当痛苦的。幸运的是,使用 Docker 非常简单,因为您只需要 NVIDIA 驱动程序和NVIDIA 容器工具包(一种插件)。其余 ( CUDA, cuDNN) Tensorflow 映像已在内部,因此您不需要在 Docker 主机上使用它们。
该驱动程序也可以部署为容器,但我不建议将其用于工作站。它适用于没有 GUI 的服务器(X 服务器等)。容器化驱动程序的主题已在本文末尾介绍,现在让我们看看如何开始tensorflow-gpu使用docker-compose. 无论容器中是否有驱动程序,该过程都是相同的。
先决条件:
要为容器启用 GPU 支持,您需要使用 NVIDIA Container Toolkit 创建容器。有两种方法可以做到这一点:
nvidia容器运行时。这样做很好,因为它就像默认运行时一样工作,除非存在一些特定于 NVIDIA 的环境变量(稍后会详细介绍)。这是通过放入"default-runtime": "nvidia"Docker 中来完成的daemon.json:/etc/docker/daemon.json:
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Run Code Online (Sandbox Code Playgroud)
docker-compose格式版本才可能2.3。docker-compose.yml以下是使用 GPU 启动 Tensorflow 的示例:
version: "2.3" # the only version where 'runtime' option is supported
services:
test:
image: tensorflow/tensorflow:2.3.0-gpu
# Make Docker create the container with NVIDIA Container Toolkit
# You don't need it if you set 'nvidia' as the default runtime in
# daemon.json.
runtime: nvidia
# the lines below are here just to test that TF can see GPUs
entrypoint:
- /usr/local/bin/python
- -c
command:
- "import tensorflow as tf; tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)"
Run Code Online (Sandbox Code Playgroud)
通过运行它,docker-compose up您应该会看到一条包含 GPU 规格的行。它出现在最后,看起来像这样:
测试_1 | 2021-01-23 11:02:46.500189:我tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]创建了TensorFlow设备(/device:GPU:0,具有1624 MB内存)->物理GPU(设备:0,名称:GeForce GTX 1050,pci 总线 ID:0000:01:00.0,计算能力:6.1)
这就是使用 GPU 启动官方 Tensorflow 图像所需的全部内容。
正如我所提到的,除非存在某些变量,否则 NVIDIA Container Toolkit 将作为默认运行时运行。此处列出并解释了这些内容。仅当您构建自定义映像并希望在其中启用 GPU 支持时,您才需要关心它们。具有 GPU 的官方 Tensorflow 图像继承自CUDA使用基础图像,因此您只需使用正确的运行时启动图像,如上例所示。
如果您对自定义 Tensorflow 图像感兴趣,我就此撰写了另一篇文章。
正如一开始提到的,这不是您想要在工作站上出现的情况。该过程要求您在没有加载其他显示驱动程序时启动驱动程序容器(例如通过 SSH)。此外,在撰写本文时仅支持 Ubuntu 16.04、Ubuntu 18.04 和 Centos 7。
有一个官方指南,下面是 Ubuntu 18.04 的摘录。
sudo sed -i 's/^#root/root/' /etc/nvidia-container-runtime/config.toml
Run Code Online (Sandbox Code Playgroud)
sudo tee /etc/modules-load.d/ipmi.conf <<< "ipmi_msghandler" \
&& sudo tee /etc/modprobe.d/blacklist-nouveau.conf <<< "blacklist nouveau" \
&& sudo tee -a /etc/modprobe.d/blacklist-nouveau.conf <<< "options nouveau modeset=0"
Run Code Online (Sandbox Code Playgroud)
如果您使用 AWS 内核,请确保i2c_core启用内核模块:
sudo tee /etc/modules-load.d/ipmi.conf <<< "i2c_core"
Run Code Online (Sandbox Code Playgroud)
initramfs:sudo update-initramfs -u
Run Code Online (Sandbox Code Playgroud)
现在是时候重新启动以使更改生效了。重新启动后检查是否未nouveau加载nvidia模块。以下命令不应返回任何内容:
lsmod | grep nouveau
lsmod | grep nvidia
Run Code Online (Sandbox Code Playgroud)
该指南提供了运行驱动程序的命令,我更喜欢docker-compose。将以下内容另存为driver.yml:
version: "3.0"
services:
driver:
image: nvidia/driver:450.80.02-ubuntu18.04
privileged: true
restart: unless-stopped
volumes:
- /run/nvidia:/run/nvidia:shared
- /var/log:/var/log
pid: "host"
container_name: nvidia-driver
Run Code Online (Sandbox Code Playgroud)
用于docker-compose -f driver.yml up -d启动驱动程序容器。为您的内核编译模块需要几分钟的时间。您可以用来docker logs nvidia-driver -f概述该过程,等待“完成,现在等待信号”行出现。否则用于lsmod | grep nvidia查看驱动程序模块是否已加载。准备就绪后,您应该会看到如下内容:
nvidia_modeset 1183744 0
nvidia_uvm 970752 0
nvidia 19722240 17 nvidia_uvm,nvidia_modeset
Run Code Online (Sandbox Code Playgroud)
自 2022 年起版本 3.x
version: "3.6"
services:
jupyter-8888:
image: "tensorflow/tensorflow:latest-gpu-jupyter"
env_file: "env-file"
deploy:
resources:
reservations:
devices:
- driver: "nvidia"
device_ids: ["0"]
capabilities: [gpu]
ports:
- 8880:8888
volumes:
- workspace:/workspace
- data:/data
Run Code Online (Sandbox Code Playgroud)
如果你想指定不同的GPU id,例如。0 和 3
device_ids: ['0', '3']
Run Code Online (Sandbox Code Playgroud)
通过在我的 Windows 计算机上安装 WSL2 来使用 VS Code 以及 Remote-Containers 扩展,设法使其正常工作。以下是对安装 WSL2 以及在其中使用 VS Code 有很大帮助的文章集:
使用 VS Code 的远程容器扩展,您可以基于 docker-compose 文件(或者像我一样只是一个 Dockerfile)来设置 devcontainer,这在上面的第三个链接中可能有更好的解释。我自己要记住的一件事是,在定义文件时,.devcontainer.json您需要确保设置
// Optional arguments passed to ``docker run ... ``
"runArgs": [
"--gpus", "all"
]
Run Code Online (Sandbox Code Playgroud)
在使用 VS Code 之前,我已经使用 Pycharm 很长时间了,所以一开始切换到 VS Code 相当痛苦,但是 VS Code 与 WSL2、远程容器和 pylance 扩展一起使得在具有 GPU 支持的容器。据我所知,Pycharcm 不支持在 WSL atm 的容器内进行调试,因为
| 归档时间: |
|
| 查看次数: |
11671 次 |
| 最近记录: |