MirroredStrategy 不使用 GPU

Question

MirroredStrategy 不使用 GPU

cra*_*aft 9 tensorflow tensorflow-estimator

我想在我的多 GPU 系统上使用 tf.contrib.distribute.MirroredStrategy() 但它不使用 GPU 进行训练（请参阅下面的输出）。我也在运行 tensorflow-gpu 1.12。

我确实尝试在 MirroredStrategy 中直接指定 GPU，但出现了同样的问题。

model = models.Model(inputs=input, outputs=y_output)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
model.compile(loss=lossFunc, optimizer=optimizer)

NUM_GPUS = 2
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=strategy)
estimator = tf.keras.estimator.model_to_estimator(model,
                                              config=config)

Run Code Online (Sandbox Code Playgroud)

这些是我得到的结果：

INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1
WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.

Run Code Online (Sandbox Code Playgroud)

预期的结果显然是在多 GPU 系统上运行训练。这些是已知问题吗？

Answer 1

小智 9

我一直面临着类似的问题，MirroredStrategy 在 tensorflow 1.13.1 上失败，2x RTX2080 运行 Estimator。

失败似乎出在 NCCL all_reduce 方法中（错误消息 - 没有为 NCCL AllReduce 注册 OpKernel）。

我通过从 NCCL 更改为hierarchical_copy 来运行它，这意味着使用 contrib cross_device_ops 方法如下：

失败的命令：

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])

Run Code Online (Sandbox Code Playgroud)

命令成功：

mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],
                      cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(
                         all_reduce_alg="hierarchical_copy")
                                                   )

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，8 月前
查看次数：	5632 次
最近记录：	4 年，6 月前