Cuda 12 + tf-nightly 2.12：在您的机器上找不到 cuda 驱动程序，将不会使用 GPU，而每次检查都很好并且在 torch 中可以正常工作

Question

Cuda 12 + tf-nightly 2.12：在您的机器上找不到 cuda 驱动程序，将不会使用 GPU，而每次检查都很好并且在 torch 中可以正常工作

tf-nightly 版本= 2.12.0-dev2023203
Python 版本= 3.10.6
CUDA 驱动程序版本= 525.85.12
CUDA 版本= 12.0
Cudnn 版本= 8.5.0
我使用的是Linux（x86_64、Ubuntu 22.04）
我正在venv虚拟环境中的Visual Studio Code中进行编码

我正在尝试使用tensorflow nightly 2.12（以便能够使用Cuda 12.0）在GPU（NVIDIA GeForce RTX 3050）上运行一些模型。我遇到的问题是，显然我所做的每项检查似乎都是正确的，但最终脚本无法检测到 GPU。我花了很多时间试图了解正在发生的事情，但似乎没有任何效果，因此任何建议或解决方案都将受到欢迎。GPU 似乎正在为 torch 工作，正如您在问题的最后看到的那样。

我将展示我所做的一些有关 CUDA 的最常见检查（Visual Studio Code 终端），希望您发现它们有用：

检查CUDA版本：

$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0
Run Code Online (Sandbox Code Playgroud)

检查与CUDA库的连接是否正确：

$ echo $LD_LIBRARY_PATH

/usr/cuda/lib
Run Code Online (Sandbox Code Playgroud)

检查 GPU 的 nvidia 驱动程序并检查 venv 的 GPU 是否可读：

$ nvidia-smi

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A | | N/A 40C P5 6W / 20W | 46MiB / 4096MiB | 22% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1356 G /usr/lib/xorg/Xorg 45MiB | +-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

添加 cuda/bin 路径并检查它：

$ export PATH="/usr/local/cuda/bin:$PATH"

$ echo $PATH

/usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
Run Code Online (Sandbox Code Playgroud)

用于检查 CUDA 是否正确安装的自定义函数：[ Sherlock 的函数]

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0
Run Code Online (Sandbox Code Playgroud)
libcudart.so.12 -> libcudart.so.12.0.146 libcuda.so.1 -> libcuda.so.525.85.12 libcuda.so.1 -> libcuda.so.525.85.12 libcudadebugger.so.1 -> libcudadebugger.so.525.85.12 libcuda is installed libcudart.so.12 -> libcudart.so.12.0.146 libcudart is installed
Run Code Online (Sandbox Code Playgroud)

自定义函数来检查 Cudnn 是否正确安装：[ function by Sherlock ]

/usr/cuda/lib
Run Code Online (Sandbox Code Playgroud)
libcudnn_cnn_train.so.8 -> libcudnn_cnn_train.so.8.8.0 libcudnn_cnn_infer.so.8 -> libcudnn_cnn_infer.so.8.8.0 libcudnn_adv_train.so.8 -> libcudnn_adv_train.so.8.8.0 libcudnn.so.8 -> libcudnn.so.8.8.0 libcudnn_ops_train.so.8 -> libcudnn_ops_train.so.8.8.0 libcudnn_adv_infer.so.8 -> libcudnn_adv_infer.so.8.8.0 libcudnn_ops_infer.so.8 -> libcudnn_ops_infer.so.8.8.0 libcudnn is installed
Run Code Online (Sandbox Code Playgroud)

因此，一旦我完成了之前的检查，我就使用脚本来评估一切是否最终正常，然后出现以下错误：

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A | | N/A 40C P5 6W / 20W | 46MiB / 4096MiB | 22% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1356 G /usr/lib/xorg/Xorg 45MiB | +-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
2023-03-02 12:05:09.463343: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-03-02 12:05:09.489911: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-03-02 12:05:09.490522: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-02 12:05:10.066759: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Tensorflow version = 2.12.0-dev20230203 2023-03-02 12:05:10.748675: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355 2023-03-02 12:05:10.771263: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... []
Run Code Online (Sandbox Code Playgroud)
额外检查：我尝试在 torch 上运行一个检查脚本，在这里它起作用了，所以我猜问题与 tensorflow/tf-nightly 有关

/usr/local/cuda-12.0/bin:/home/victus-linux/Escritorio/MasterThesis_CODE/to_share/venv_master/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
Run Code Online (Sandbox Code Playgroud)
Available cuda = True GPUs availables = 1 Current device = 0 Current Device location = <torch.cuda.device object at 0x7fbe26fd2ec0> Name of the device = NVIDIA GeForce RTX 3050 Laptop GPU
Run Code Online (Sandbox Code Playgroud)
如果您知道一些可能有助于解决此问题的信息，请随时告诉我。

Answer 1

ari*_*ero 31

我认为，截至 2023 年 3 月，cuda 12 的唯一张量流发行版是 NVIDIA 的 docker 软件包。

cuda 12 的 tf 包应显示以下信息

>>> tf.sysconfig.get_build_info() 
OrderedDict([('cpu_compiler', '/usr/bin/x86_64-linux-gnu-gcc-11'), 
('cuda_compute_capabilities', ['compute_86']), 
('cuda_version', '12.0'), ('cudnn_version', '8'), 
('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)])

Run Code Online (Sandbox Code Playgroud)

但是，如果我们在通过 pip 安装的任何 TensorFlow 包上运行 tf.sysconfig.get_build_info()，它仍然会告诉 cuda_version 是 11.x

所以你的选择是：

使用 nvidia 云指令安装 docker 并运行最近的容器之一
从源代码编译tensorflow，无论是每晚还是最后一个版本。需要注意的是，它需要大量的 RAM 和一些时间，就像所有好的编译一样，并且偶尔会出现错误，需要在运行时纠正。就我而言，定义 kFP8，即新的 8 位浮点数。
等待

一个月后，我现在可以通过首先卸载 pip tensorflow 软件包，然后运行“sudo pacman -S tensorflow-cuda python-tensorflow-cuda”，从 pip 切换到 arch 存储库版本。这给了我“cuda_version”，“12.1”并修复了“libcudart.so.11.0”加载错误。 (4认同)

Answer 2

小智 8

“我也遇到过同样的事情，安装TensorFlowRT就可以解决。”

pip3 安装 nvidia-tensorrt
再次检查 libnvinfer.* 文件链接，并确保 LD_LIBRARY_PATH 指向安装目录。”
参考：无法加载动态库“libnvinfer.so.7”

修复所有库后，GPU 输出将可见。GPU 可见：

Answer 3

小智 5

一个更简单、更最新的解决方案 - 只需使用以下命令进行安装：

pip3 install tensorflow[and-cuda]

Run Code Online (Sandbox Code Playgroud)

安装 cuda-11 库和张量流对我来说没有任何问题（ubuntu 22.04，RTX-4090）。

归档时间：	3 年，3 月前
查看次数：	68922 次
最近记录：	2 年，3 月前