具有多个gpu的分布式张量流

nak*_*ung 0 distributed tensorflow

似乎tf.train.replica_device_setter不允许指定使用的gpu.

我想做的是如下:

 with tf.device(
   tf.train.replica_device_setter(
   worker_device='/job:worker:task:%d/gpu:%d' % (deviceindex, gpuindex)):
     <build-some-tf-graph>
Run Code Online (Sandbox Code Playgroud)

Yar*_*tov 7

如果您的参数没有分片,您可以使用replica_device_setter下面的简化版本来完成:

def assign_to_device(worker=0, gpu=0, ps_device="/job:ps/task:0/cpu:0"):
    def _assign(op):
        node_def = op if isinstance(op, tf.NodeDef) else op.node_def
        if node_def.op == "Variable":
            return ps_device
        else:
            return "/job:worker/task:%d/gpu:%d" % (worker, gpu)
    return _assign

with tf.device(assign_to_device(1, 2)):
  # this op goes on worker 1 gpu 2
  my_op = tf.ones(())
Run Code Online (Sandbox Code Playgroud)