Cuda 12 + tf-nightly 2.12:在您的机器上找不到 cuda 驱动程序,将不会使用 GPU,而每次检查都很好并且在 torch 中可以正常工作

Jai*_*ton 47 python gpu tensorflow

  • tf-nightly 版本= 2.12.0-dev2023203
  • Python 版本= 3.10.6
  • CUDA 驱动程序版本= 525.85.12
  • CUDA 版本= 12.0
  • Cudnn 版本= 8.5.0
  • 我使用的是Linux(x86_64、Ubuntu 22.04)
  • 我正在venv虚拟环境中的Visual Studio Code中进行编码

我正在尝试使用tensorflow nightly 2.12(以便能够使用Cuda 12.0)在GPU(NVIDIA GeForce RTX 3050)上运行一些模型。我遇到的问题是,显然我所做的每项检查似乎都是正确的,但最终脚本无法检测到 GPU。我花了很多时间试图了解正在发生的事情,但似乎没有任何效果,因此任何建议或解决方案都将受到欢迎。GPU 似乎正在为 torch 工作,正如您在问题的最后看到的那样。

我将展示我所做的一些有关 CUDA 的最常见检查(Visual Studio Code 终端),希望您发现它们有用:

  1. 检查CUDA版本:

    $ nvcc --version

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Fri_Jan__6_16:45:21_PST_2023
    Cuda compilation tools, release 12.0, V12.0.140
    Build cuda_12.0.r12.0/compiler.32267302_0
    
    Run Code Online (Sandbox Code Playgroud)
  2. 检查与CUDA库的连接是否正确:

    $ echo $LD_LIBRARY_PATH

    /usr/cuda/lib
    
    Run Code Online (Sandbox Code Playgroud)
  3. 检查 GPU 的 nvidia 驱动程序并检查 venv 的 GPU 是否可读:

    $ nvidia-smi

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
    | N/A   40C    P5     6W /  20W |     46MiB /  4096MiB |     22%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A      1356      G   /usr/lib/xorg/Xorg                 45MiB |
    +-----------------------------------------------------------------------------+
    
    Run Code Online (Sandbox Code Playgroud)
  4. 添加 cuda/bin 路径并检查它:

    $ export PATH="/usr/local/cuda/bin:$PATH"

    $ echo $PATH

    /usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
    
    Run Code Online (Sandbox Code Playgroud)
  5. 用于检查 CUDA 是否正确安装的自定义函数:[ Sherlock 的函数]

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Fri_Jan__6_16:45:21_PST_2023
    Cuda compilation tools, release 12.0, V12.0.140
    Build cuda_12.0.r12.0/compiler.32267302_0
    
    Run Code Online (Sandbox Code Playgroud)
    libcudart.so.12 -> libcudart.so.12.0.146
            libcuda.so.1 -> libcuda.so.525.85.12
            libcuda.so.1 -> libcuda.so.525.85.12
            libcudadebugger.so.1 -> libcudadebugger.so.525.85.12
    libcuda is installed
            libcudart.so.12 -> libcudart.so.12.0.146
    libcudart is installed
    
    Run Code Online (Sandbox Code Playgroud)
  6. 自定义函数来检查 Cudnn 是否正确安装:[ function by Sherlock ]

    /usr/cuda/lib
    
    Run Code Online (Sandbox Code Playgroud)
            libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.8.0
            libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.8.0
            libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.8.0
            libcudnn.so.8 -> libcudnn.so.8.8.0
            libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.8.0
            libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.8.0
            libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.8.0
    libcudnn is installed
    
    Run Code Online (Sandbox Code Playgroud)

因此,一旦我完成了之前的检查,我就使用脚本来评估一切是否最终正常,然后出现以下错误:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   40C    P5     6W /  20W |     46MiB /  4096MiB |     22%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1356      G   /usr/lib/xorg/Xorg                 45MiB |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

Tensorflow version = 2.12.0-dev20230203

2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

[]
Run Code Online (Sandbox Code Playgroud)

额外检查:我尝试在 torch 上运行一个检查脚本,在这里它起作用了,所以我猜问题与 tensorflow/tf-nightly 有关

/usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
Run Code Online (Sandbox Code Playgroud)
Available cuda = True

GPUs availables = 1

Current device = 0

Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0>

Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU
Run Code Online (Sandbox Code Playgroud)

如果您知道一些可能有助于解决此问题的信息,请随时告诉我。

ari*_*ero 31

我认为,截至 2023 年 3 月,cuda 12 的唯一张量流发行版是 NVIDIA 的 docker 软件包。

cuda 12 的 tf 包应显示以下信息

>>> tf.sysconfig.get_build_info() 
OrderedDict([('cpu_compiler', '/usr/bin/x86_64-linux-gnu-gcc-11'), 
('cuda_compute_capabilities', ['compute_86']), 
('cuda_version', '12.0'), ('cudnn_version', '8'), 
('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])
Run Code Online (Sandbox Code Playgroud)

但是,如果我们在通过 pip 安装的任何 TensorFlow 包上运行 tf.sysconfig.get_build_info(),它仍然会告诉 cuda_version 是 11.x

所以你的选择是:

  • 使用 nvidia 云指令安装 docker 并运行最近的容器之一
  • 从源代码编译tensorflow,无论是每晚还是最后一个版本。需要注意的是,它需要大量的 RAM 和一些时间,就像所有好的编译一样,并且偶尔会出现错误,需要在运行时纠正。就我而言,定义 kFP8,即新的 8 位浮点数。
  • 等待

  • 一个月后,我现在可以通过首先卸载 pip tensorflow 软件包,然后运行“sudo pacman -S tensorflow-cuda python-tensorflow-cuda”,从 pip 切换到 arch 存储库版本。这给了我“cuda_version”,“12.1”并修复了“libcudart.so.11.0”加载错误。 (4认同)

小智 8

“我也遇到过同样的事情,安装TensorFlowRT就可以解决。”

  1. pip3 安装 nvidia-tensorrt
  2. 再次检查 libnvinfer.* 文件链接,并确保 LD_LIBRARY_PATH 指向安装目录。”
  3. 参考:无法加载动态库“libnvinfer.so.7”

修复所有库后,GPU 输出将可见。GPU 可见:

GPU可见


小智 5

一个更简单、更最新的解决方案 - 只需使用以下命令进行安装:

pip3 install tensorflow[and-cuda]
Run Code Online (Sandbox Code Playgroud)

安装 cuda-11 库和张量流对我来说没有任何问题(ubuntu 22.04,RTX-4090)。