Edo*_*doG 5 python-3.x python-multiprocessing tensorflow tensorflow2.0
我正在开发一个项目,其中我有一个 python 模块,它实现了迭代过程,并且一些计算是由 GPU 使用 TensorFlow 2.0 执行的。该模块在单个进程中独立使用时可以正常工作。
由于我必须使用不同的参数执行多次运行,所以我想并行化调用,但是当我从不同的进程调用模块(导入张量流)时,我得到了 的CUDA_ERROR_OUT_OF_MEMORY无限循环CUDA_ERROR_NOT_INITIALIZED,因此生成的进程永远挂起。
当然,我尝试限制 GPU 内存,如果我从不同的解释器运行两个不同的 python 脚本,它可以正常工作,但在我的情况下似乎不起作用。
特别是,如果我使用
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
Run Code Online (Sandbox Code Playgroud)
我得到 的无限循环CUDA_ERROR_NOT_INITIALIZED,而如果我使用:
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
tf.config.experimental.set_memory_growth(physical_devices[0], True)
else:
print("No GPU found, model running on CPU")
Run Code Online (Sandbox Code Playgroud)
该进程也挂起,但每个生成的进程都会出现错误。
通过读取 Tensorflow 控制台输出,第一个生成的进程似乎在 GPU 上分配内存,但它和其他抱怨内存耗尽的进程一样挂起。奇怪的是,在 nvidia-smi 中 GPU 内存似乎根本没有耗尽。
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:03:00.0 On | N/A |
| 29% 42C P8 28W / 250W | 755MiB / 12035MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
我设法编写了该问题的最小可重现示例:
文件“tf_module.py”
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
else:
print("Running on CPU")
def run(x, y):
return tf.add(x, y).numpy()
Run Code Online (Sandbox Code Playgroud)
文件“run.py”
from multiprocessing import Pool
import tf_module as experiment
def run_exp(params):
a, b = params
return experiment.run(a, b)
pool = Pool(2)
params = [(a, b) for a in range(3) for b in range(3)]
results = pool.map(run_exp, params)
Run Code Online (Sandbox Code Playgroud)
将 TF 计算移出模块是不可行的,因为它是复杂管道的一部分,其中还涉及 numpy,因此我需要并行化这段代码。
我错过了什么吗?
提前致谢
| 归档时间: |
|
| 查看次数: |
708 次 |
| 最近记录: |