使用自定义 GPU CUDA 环境的 Azure ML 实验

Question

使用自定义 GPU CUDA 环境的 Azure ML 实验

car*_*d27 3 python pytorch azure-machine-learning-service

上周我一直在尝试在Azure ML studio中创建一个 python 实验。该工作包括使用具有 CUDA 11.6 的自定义环境来训练 PyTorch (1.12.1) 神经网络以实现 GPU 加速。但是，当尝试任何移动操作时，我收到运行时错误：

device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device = "cpu")
test_tensor.to(device)

Run Code Online (Sandbox Code Playgroud)

CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Run Code Online (Sandbox Code Playgroud)

我尝试设置 CUDA_LAUNCH_BLOCKING=1，但这不会改变结果。

我还尝试检查 CUDA 是否可用：

print(f"Is cuda available? {torch.cuda.is_available()}")
print(f"Which is the current device? {torch.cuda.current_device()}")
print(f"How many devices do we have? {torch.cuda.device_count()}")
print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")

Run Code Online (Sandbox Code Playgroud)

结果完全正常：

Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80

Run Code Online (Sandbox Code Playgroud)

我还尝试降级和更改 CUDA、Torch 和 Python 版本，但这似乎并不影响错误。

据我发现，只有在使用自定义环境时才会出现此错误。当使用策划的环境时，脚本运行没有问题。然而，由于一些库（如OpenCV ）的脚本需要，我被迫使用自定义 DockerFile 来创建我的环境，您可以在此处阅读以供参考：

FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1


USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y 

RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
                'azureml-core' \
        'azureml-dataset-runtime' \
                'azureml-defaults' \
        'azure-ml' \
        'azure-ml-component' \
                'azureml-mlflow' \
                'azureml-telemetry' \
        'azureml-contrib-services'

COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
    cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
    cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
    ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
    rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888

Run Code Online (Sandbox Code Playgroud)

该COPY语句中的代码是来自 Azure 已预定义的策划环境之一的副本。我想强调的是，我尝试使用这些环境之一中给出的 DockerFile，没有进行任何修改，并且得到了相同的结果。

因此，我的问题是：如何使用自定义环境运行 CUDA 作业？是否可以？

我试图为此找到解决方案，但我无法找到任何遇到同样问题的人，也无法在 Microsoft 文档中找到我可以询问此问题的任何地方。我希望这不会重复，并且你们中的任何人都可以在这里帮助我。

Answer 1

Tim*_*lin 5

这个问题确实很敏感，很难调试。我怀疑这与部署 docker 容器的底层硬件有关，而不是与实际的自定义 Docker 容器及其相应的依赖项有关。

由于您有 Tesla K80，我怀疑 NC 系列显卡（在其上部署了环境）。

截至撰写此评论时（2023 年 2 月 10 日），以下观察结果有效 ( https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments )：

笔记

目前，由于底层 cuda 和集群不兼容，在 NC 系列上只能使用带有 cuda 11.3 的 AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu。

因此，在我看来，这可以追溯到CUDA + PyTorch和Python的支持版本。

在我的例子中，我只是.yaml在创建环境时通过依赖文件安装了我的依赖项，从这个基本映像开始：

Azure 容器注册表

mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9

Run Code Online (Sandbox Code Playgroud)

您可以开始从此 URI 作为基础映像构建 Docker 容器，以便在 Tesla K80s 上正常工作。

重要提示：使用这个基础图像在我的案例中确实有效，我能够训练 PyTorch 模型。

归档时间：	2 年，8 月前
查看次数：	1282 次
最近记录：	2 年，8 月前