运行时错误:CUDA 错误:没有可在设备上执行的内核映像(光栅视觉)

gui*_*s93 4 ubuntu cuda runtime pytorch nvidia-docker

您好,我正在尝试在 GPU NVIDIA GEOFORCE 3050 RTX 上运行光栅视觉管道。

  • 乌班图22.04
  • Pytorch:版本:1.12.0+cu116
  • CUDA:12

但是当我像这样运行 Docker 容器时: sudo docker run --rm --runtime=nvidia --gpus all -it -v ${RV_QUICKSTART_CODE_DIR}:/opt/src/code -v ${RV_QUICKSTART_OUT_DIR}:/opt/数据/输出 quay.io/azavea/raster-vision:pytorch-0.20 /bin/bash

该模型不会训练并输出此错误: RuntimeError: CUDA error: no kernel image is available forexecution on the device CUDA 内核错误可能会在某些其他 API 调用中异步报告,因此下面的堆栈跟踪可能不正确。对于调试,请考虑传递 CUDA_LAUNCH_BLOCKING=1。

PD:运行nvidia-smi输出GPU的特征,意味着它被识别。我非常感谢在这个问题上的一些帮助。谢谢!

这是我得到的输出:

`Skipping 'analyze' command...
python -m rastervision.pipeline.cli run_command /opt/data/output/pipeline-config.json train
Running train command...
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Building datasets ...
2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
2023-03-09 08:53:29:rastervision.core.data.raster_source.rasterio_source: WARNING - Raster block size (2, 650) is too non-square. This can slow down reading. Consider re-tiling using GDAL.
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Physical CPUs: 12
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Logical CPUs: 16
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Total memory:  15.30 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of /opt/data volume:  445.44 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Size of / volume:  445.44 GB
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Python version: 3.9.16 (main, Jan 11 2023, 16:05:54) 
[GCC 11.2.0]
/bin/sh: 1: nvcc: not found
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - 
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Thu Mar  9 08:53:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   37C    P3    14W /  30W |    262MiB /  4096MiB |      7%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Devices:
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - index, name, driver_version, memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, NVIDIA GeForce RTX 3050 Ti Laptop GPU, 525.89.02, 4096 MiB, 262 MiB, 3639 MiB

2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - PyTorch version: 1.12.1+cu102
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA available: True
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDA version: 10.2
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - CUDNN version: 7605
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Number of CUDA devices: 1
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Active CUDA Device: GPU 0
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - model=SemanticSegmentationModelConfig(backbone=<Backbone.resnet50: 'resnet50'>, pretrained=True, init_weights=None, load_strict=True, external_def=None) solver=SolverConfig(lr=0.0001, num_epochs=1, test_num_epochs=2, test_batch_sz=4, overfit_num_steps=1, sync_interval=1, batch_sz=2, one_cycle=True, multi_stage=[], class_loss_weights=None, ignore_class_index=None, external_loss_def=None) data=SemanticSegmentationGeoDataConfig(scene_dataset='<1 train_scenes, 1 validation_scenes, 0 test_scenes>', window_opts="method=<GeoDataWindowMethod.random: 'random'> size=300 stride=None padding=None pad_direction='end' size_lims=(300, 301) h_lims=None w_lims=None max_windows=10 max_sample_attempts=100 efficient_aoi_sampling=True") predict_mode=False test_mode=False overfit_mode=False eval_train=False save_model_bundle=True log_tensorboard=True run_tensorboard=False output_uri='/opt/data/output/train'
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Using device: cuda
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - train_ds: 10 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - valid_ds: 10 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - test_ds: 0 items
2023-03-09 08:53:29:rastervision.pytorch_learner.learner: INFO - Plotting sample training batch.
2023-03-09 08:53:30:rastervision.pytorch_learner.learner: INFO - Plotting sample validation batch.
2023-03-09 08:53:31:rastervision.pytorch_learner.learner: INFO - epoch: 0
Training:   0%|                                                                   | 0/5 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 251, in <module>
    _main()
  File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 247, in _main
    main()
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 236, in run_command
    _run_command(
  File "/opt/src/rastervision_pipeline/rastervision/pipeline/cli.py", line 218, in _run_command
    command_fn()
  File "/opt/src/rastervision_core/rastervision/core/rv_pipeline/rv_pipeline.py", line 154, in train
    backend.train(source_bundle_uri=self.config.source_bundle_uri)
  File "/opt/src/rastervision_pytorch_backend/rastervision/pytorch_backend/pytorch_learner_backend.py", line 120, in train
    learner.main()
  File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 267, in main
    self.train()
  File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1265, in train
    train_metrics = self.train_epoch(
  File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/learner.py", line 1188, in train_epoch
    output = self.train_step(batch, batch_ind)
  File "/opt/src/rastervision_pytorch_learner/rastervision/pytorch_learner/semantic_segmentation_learner.py", line 26, in train_step
    out = self.post_forward(self.model(x))
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torchvision/models/segmentation/_utils.py", line 23, in forward
    features = self.backbone(x)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torchvision/models/_utils.py", line 69, in forward
    x = module(x)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 148, in forward
    self.num_batches_tracked.add_(1)  # type: ignore[has-type]
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
make: *** [/opt/data/output/Makefile:6: 0] Error 1`
Run Code Online (Sandbox Code Playgroud)

Ani*_*ram 12

当 CUDA 代码未针对您的 GPU 架构进行编译时,就会出现此错误。此处,Rastervision Docker 映像使用的 PyTorch 版本不包括为sm_86(Ampere GeForce) 编译的 CUDA 代码。

作为解决方法,您可以强制重新安装包含sm_86. 使用 启动容器后docker run,运行以下命令:

pip install --force-reinstall torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/
Run Code Online (Sandbox Code Playgroud)

  • 这就像魔术一样。非常感谢阿尼斯,你真的解决了我的问题!我还想知道,由于每次创建容器时我都必须重新安装 Pytorch(通过运行上面提供的命令),是否有任何方法可以在容器外部(例如在我的虚拟环境中)重新安装 Pytorch,以便不需要每次创建容器时都运行命令?再次感谢 :) (2认同)