我正在尝试使用我的 GPU 运行张量流,并按照此链接中的说明进行操作。运行步骤 6 中的命令后,我得到了正确的输出。
然后,当我尝试运行我尝试构建的实际模型时,出现以下错误。
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node 'StatefulPartitionedCall_10' defined at (most recent call last):
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 42, in <module>
main(sys.argv[1:])
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_Main.py", line 27, in main
model.train()
File "/home/jerry/Woodburn/Woodburn_Model/model/main/Model_V5.py", line 99, in train
history = self.model.fit(x, y, batch_size = batchSize, epochs = epochs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/jerry/miniconda3/envs/tensorflow_gpu/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_10'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_10}}]] [Op:__inference_train_function_8591]
Run Code Online (Sandbox Code Playgroud)
经过一番研究,发现相关错误如下:
2023-01-06 18:39:14.692537: W tensorflow/compiler/xla/service/gpu/nvptx_helper.cc:56] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.2
/usr/local/cuda
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-01-06 18:39:14.693094: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.693196: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
2023-01-06 18:39:14.693275: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-01-06 18:39:14.704458: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:326] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-01-06 18:39:14.704603: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:446 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
Run Code Online (Sandbox Code Playgroud)
对于上下文,它在 Ubuntu 20.04 和 python 3.9 中运行。关于如何修复有什么想法吗?
小智 1
如果您Cudatoolkit
通过安装,则可以通过设置withConda
来解决问题。您可以在每次激活环境时设置 env 变量,如下所示:https ://conda.io/docs/user-guide/tasks/manage-environments.html#macos-and-linuxXLA_FLAGS
export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX
归档时间: |
|
查看次数: |
2791 次 |
最近记录: |