在 Ubuntu 20.04 上使用 GPU 设置 Tensorflow 2.4，无需 sudo

Question

在 Ubuntu 20.04 上使用 GPU 设置 Tensorflow 2.4，无需 sudo

phe*_*ath 2 python ubuntu tensorflow ubuntu-20.04

我可以访问具有 Ubuntu 20.04 设置和 GPU 的虚拟机。系统管理员已经安装了最新的 Cuda 驱动程序，但不幸的是，这还不足以在 Tensorflow 中使用 GPU，因为每个版本的 TF 在涉及特定的 Cuda Toolkit + CuDNN 版本集时都可能非常挑剔。我没有 sudo 权限，所以我需要在本地安装所有内容。

nvidia-smi

Run Code Online (Sandbox Code Playgroud)

返回驱动程序版本：465.19.01 CUDA 版本：11.3

python -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s'); tf.config.list_physical_devices('GPU');"

Run Code Online (Sandbox Code Playgroud)

回报

2021-05-11 10:56:26.737279：W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库“libcudart.so.11.0”；dlerror: libcudart.so.11.0: 无法打开共享对象文件: 没有这样的文件或目录
2021-05-11 10:56:26.737338: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] 如果这样做，请忽略上面的 cudart dlerror您的机器上没有设置 GPU。
2021-05-11 10：56：28.313896：我tensorflow/compiler/jit/xla_cpu_device.cc：41]不创建XLA设备，tf_xla_enable_xla_devices未设置
2021-05-11 10：56：28.315540：我tensorflow/stream_executor/platform/ default/dso_loader.cc:49] 成功打开动态库 libcuda.so.1
2021-05-11 10:56:28.324232: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] 从 SysFS 读取的成功 NUMA 节点具有负值(-1)，但必须至少有一个 NUMA 节点，因此返回 NUMA 节点零
2021-05-11 10:56:28.324707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with属性：
pciBusID：0000：00：05.0 名称：NVIDIA Tesla P100-PCIE-12GB 计算能力：6.0
coreClock：1.3285GHz coreCount：56 deviceMemorySize：11.91GiB deviceMemoryBandwidth：511.41GiB/s
2021-05-11 10:56:28.3248 67：我tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] 从 SysFS 读取的成功 NUMA 节点具有负值 (-1)，但必须至少有一个 NUMA 节点，因此返回 NUMA 节点零
2021-05-11 10:56： 28.325293：我tensorflow/core/common_runtime/gpu/gpu_device.cc:1720]找到设备1，其属性：
pciBusID：0000：00：06.0名称：NVIDIA Tesla P100-PCIE-12GBcomputeCapability：6.0
coreClock：1.3285GHz coreCount：56 deviceMemorySize : 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-05-11 10:56:28.325438: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库“libcudart.so.11.0”；dlerror：libcudart.so.11.0：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10：56：28.325563：Wtensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库'libcublas.so.11'; dlerror: libcublas.so.11: 无法打开共享对象文件: 没有这样的文件或目录
2021-05-11 10:56:28.325706: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库'libcublasLt.so.11'; dlerror: libcublasLt.so.11: 无法打开共享对象文件: 没有这样的文件或目录
2021-05-11 10:56:28.325820: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库'libcufft.so.10'; dlerror: libcufft.so.10: 无法打开共享对象文件: 没有这样的文件或目录
2021-05-11 10:56:28.325931：W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库“libcurand.so.10”；dlerror：libcurand.so.10：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10:56:28.326028：W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库“libcusolver.so.10”；dlerror: libcusolver.so.10: 无法打开共享对象文件: 没有这样的文件或目录
2021-05-11 10:56:28.326117: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] 无法加载动态库'libcusparse.so.11'; dlerror：libcusparse.so.11：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10：56：28.326215：Wtensorflow/stream_executor/platform/default/dso_loader.cc:60]无法加载动态库'libcudnn.so.8'; dlerror：libcudnn.so.8：无法打开共享对象文件：没有这样的文件或目录
2021-05-11 10：56：28.326230：W tensorflow / core / common_runtime / gpu / gpu_device.cc：1757]无法dlopen某些GPU库。如果您想使用 GPU，请确保正确安装上述缺少的库。请按照https://www.tensorflow.org/install/gpu上的指南了解如何下载和设置您的平台所需的库。
正在跳过注册 GPU 设备...

这表明 TF 应用程序中不会使用 GPU。

我不得不花费一些时间来设置虚拟机，所以我将在下面逐步发布我的解决方案。

Answer 1

phe*_*ath 6

在没有管理员权限的 Ubuntu 20.04 环境中设置 Tensorflow 2.4.x（针对 2.4.1 进行测试）的说明。假设系统管理员已经安装了最新的 Cuda 驱动程序。它由安装 Cuda 11.0 工具包 + CuDNN 8.2.0 组成。

\n

以下说明将在目录 /home/pherath/cuda_toolkits/cuda-11.0 下安装 CUDA 11.0（经测试可用于 Tensorflow 2.4.1），无需 sudo 权限。

\n

步骤1.下载CUDA 11.0

\n

wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run\nchmod +x cuda_11.0.2_450.51.05_linux.run\n

Run Code Online (Sandbox Code Playgroud)\n

步骤 2，选项 1：要获得快速自动化表单，请使用以下命令

\n

./cuda_11.0.2_450.51.05_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0\n

Run Code Online (Sandbox Code Playgroud)\n

第 2 步，选项 2：这是可视化分步指南

\n

./cuda_11.0.2_450.51.05_linux.run\n

Run Code Online (Sandbox Code Playgroud)\n

继续，然后接受 EULA。

\n

仅选中 Cuda Toolkit，取消选中其他所有内容。然后转到选项。

\n

进入工具包选项。

\n

取消选中所有内容，然后转到更改工具包安装路径并将其替换为 /home/pherath/cuda_toolkits/cuda-11.0 在此步骤之后，继续安装。

\n

步骤3.下载CUDA 11.0补丁

\n

wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda_11.0.3_450.51.06_linux.run\nchmod +x cuda_11.0.3_450.51.06_linux.run\n

Run Code Online (Sandbox Code Playgroud)\n

步骤 4. 选项 1：快速静音模式

\n

./cuda_11.0.3_450.51.06_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-11.0\n

Run Code Online (Sandbox Code Playgroud)\n

步骤 4. 选项 2：GUI 模式\n重复步骤 2、选项 2 的确切步骤。

\n

安装可能会出错。\n在检查日志时，我看到的错误表明安装脚本中可能存在错误。唯一令人反感的术语是一个文件的符号链接。

\n

\n
[错误]：boost::filesystem::create_symlink：文件存在：“libcuinj64.so.11.0”，“/home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib/libcuinj64.so”
\n

\n

我在各种发行版尝试中遇到了其他几个单一错误（例如，在 Ubuntu 16.04 上）：
\nlibcuinj64.so.11.0、libaccinj64.so.11.0、libnvrtc-builtins.so.11.0

\n

此错误可以通过以下两行修复

\n

cd /home/pherath/cuda_toolkits/cuda-11.0/targets/x86_64-linux/lib # move to the dir of the offending line\nln -s libaccinj64.so.11.0 libaccinj64.so #reorder such that symbolic link and target are in correct order (we need libaccinj64.so -> libaccinj64.so.11.0)\n

Run Code Online (Sandbox Code Playgroud)\n

步骤 5. 下载 CuDNN 8.2.0

\n

cd /home/pherath/cuda_toolkits # move back to the parent of previous dir\n

Run Code Online (Sandbox Code Playgroud)\n

您需要从CuDNN 档案下载 CuDNN .tgz 文件，我使用 v8.2.0。此步骤将要求您在 CuDNN 创建一个帐户并通过网络界面下载。如果您要设置tensorflow的机器上没有\xe2\x80\x99t GUI，我建议使用“Link Redirect Trace”插件来跟踪下载文件的确切链接（这里是一个google chrome插件-在链接上）。您可以使用本地计算机的 GUI 跟踪链接，然后使用 wget 将跟踪的链接下载到虚拟机上。请注意，此跟踪链接的生命周期相对较短。

\n

下载后，名称仍然是加密的，将其重命名回.tgz：

\n

mv $some_ambiguous_name cudnn-11.3-linux-x64-v8.2.0.53.tgz\n

Run Code Online (Sandbox Code Playgroud)\n

现在我们在 cuda 安装目录的父目录中解压

\n

tar -xvzf cudnn-11.3-linux-x64-v8.2.0.53.tgz # this will extract things under a dir called \'cuda\'\n

Run Code Online (Sandbox Code Playgroud)\n

现在我们需要复制所有lib64并包含到cuda工具包安装下的相应目录中

\n

cp -fv cuda/lib64/*.* cuda-11.0/lib64/.\ncp -fv cuda/include/*.* cuda-11.0/include/.\n

Run Code Online (Sandbox Code Playgroud)\n

步骤 6. 创建/追加/前置 PATH 和 LD_LIBRARY_PATH 环境变量。

\n

将以下行添加到 ~/.bashrc 的末尾（否则，请确保为您将运行 TF 脚本的每个 bash 扩展相应的环境变量）。

\n

\n
导出 CUDA11=/home/pherath/cuda_toolkits/cuda-11.0
\nexport PATH=$CUDA11/bin:$PATH
\nexport LD_LIBRARY_PATH=$CUDA11/lib64:$CUDA11/extras/CUPTI/lib64:$LD_LIBRARY_PATH
\n

\n

启动新终端或

\n

source ~/.bashrc \n

Run Code Online (Sandbox Code Playgroud)\n

在每个现有航站楼中。

\n

检查安装是否有效

\n

您可以运行以下几行来测试 TF 2.4.1 + profiler 是否正常工作：

\n

conda create -n tf python==3.7 -y  # create a python environment\nconda activate tf #activate the virtual environment (here conda)\npip install tensorflow==2.4.1 # install tf 2.4.1\npython -c "import tensorflow as tf, logging; logging.basicConfig(level=logging.INFO, format=\'%(asctime)s %(message)s\'); tf.config.list_physical_devices(\'GPU\'); tf.profiler.experimental.start(\'.\'); tf.profiler.experimental.stop()" # test to see if TF with GPU works\n

Run Code Online (Sandbox Code Playgroud)\n

#################################################### #######################

\n

如果您想在 Ubuntu 20.04 LTS 上安装 Cuda Toolkit 10.2，则单行安装代码会相应更改（需要添加library_path，并覆盖 gcc 版本不匹配的投诉）。

\n

./cuda_10.2.89_440.33.01_linux.run --silent --tmpdir=. --toolkit --toolkitpath=/home/pherath/cuda_toolkits/cuda-10.2 --librarypath=/home/pherath/cuda_toolkits/cuda-10.2 --override\n

Run Code Online (Sandbox Code Playgroud)\n

请记住，您还需要对 cuda toolkit 10.2 的补丁重复此过程。之后，您需要下载相应的 cuDNN 并将 lib64 & include 复制到 cuda toolkit\ 目录中（与上面的说明相同）。

\n

#################################################### #######################

\n

如果仍然出现错误，则很可能您没有安装正确的 cuda/nvidia 驱动程序。要解决此问题，您将需要 sudo 权限！

\n

1.

\n

首先，清除所有 cuda/nvidia 内容（由于声誉有限，我无法添加参考..）；基本上使用 sudo 权限运行下面的行。\napt clean; apt update; apt purge cuda; apt purge nvidia-*; apt autoremove; apt install cuda

\n

2.

\n

按照谷歌的说明进行操作https://cloud.google.com/compute/docs/gpus/install-drivers-gpu#ubuntu-driver-steps

\n

3.

\n

重新启动机器。

\n

归档时间：	4 年，8 月前
查看次数：	3871 次
最近记录：	4 年，7 月前