分布式Tensorflow:ValueError"当:使用副本时,所有变量必须设置其设备"set:name:"Variable"

spr*_*vem 6 python tensorflow

我试图在张量流上写一个分布式变分自动编码器standalone mode.

我的群集包括3台机器,命名为m1,m2和m3.我试图在m1上运行1 ps服务器,在m2和m3上运行2个工作服务器.(分布式tensorflow文档中的示例培训师程序)在m3上,我收到以下错误消息:

Traceback (most recent call last): 
 File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module> 
   save_model_secs=600) 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__ 
   self._verify_setup() 
 File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup 
   "their device set: %s" % op) 
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable" 
attr { 
 key: "container" 
 value { 
   s: "" 
 } 
} 
attr { 
 key: "dtype" 
 value { 
   type: DT_INT32 
 } 
} 
attr { 
 key: "shape" 
 value { 
   shape { 
   } 
 } 
} 
attr { 
 key: "shared_name" 
 value { 
   s: "" 
 } 
}
Run Code Online (Sandbox Code Playgroud)

这是我的代码的一部分,它定义了网络和Supervisor.

if FLAGS.job_name == "ps":
    server.join()
elif FLAGS.job_name == "worker":

    #set distributed device
    with tf.device(tf.train.replica_device_setter(
        worker_device="/job:worker/task:%d" % FLAGS.task_index,
        cluster=clusterSpec)):

        # Build the training computation graph
        x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
        optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
        with tf.variable_scope("model") as scope:
            with pt.defaults_scope(phase=pt.Phase.train):
                train_model = M1(n_z, x_train.shape[1])
                train_vz_mean, train_vz_logstd = q_net(x, n_z)
                train_variational = ReparameterizedNormal(
                    train_vz_mean, train_vz_logstd)
                grads, lower_bound = advi(
                    train_model, x, train_variational, lb_samples, optimizer)
                infer = optimizer.apply_gradients(grads)
        #print(type(lower_bound))

        # Build the evaluation computation graph
        with tf.variable_scope("model", reuse=True) as scope:
            with pt.defaults_scope(phase=pt.Phase.test):
                eval_model = M1(n_z, x_train.shape[1])
                eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
                eval_variational = ReparameterizedNormal(
                    eval_vz_mean, eval_vz_logstd)
                eval_lower_bound = is_loglikelihood(
                    eval_model, x, eval_variational, lb_samples)
                eval_log_likelihood = is_loglikelihood(
                    eval_model, x, eval_variational, ll_samples)

    #saver = tf.train.Saver()
    summary_op = tf.merge_all_summaries()
    global_step = tf.Variable(0)
    init_op = tf.initialize_all_variables()

    # Create a "supervisor", which oversees the training process.
    sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), 
                             logdir=LogDir,
                             init_op=init_op,
                             summary_op=summary_op,
    #                         saver=saver,
                             global_step=global_step,
                             save_model_secs=600)
    print("create sv done")
Run Code Online (Sandbox Code Playgroud)

我认为我的代码肯定有问题,但我不知道如何修复它.有什么建议?非常感谢!

mrr*_*rry 7

问题源于global_step变量的定义:

global_step = tf.Variable(0)
Run Code Online (Sandbox Code Playgroud)

此定义超出了上述with tf.device(tf.train.replica_device_setter(...)):块的范围,因此未分配任何设备global_step.在复制培训中,这通常是错误的来源(因为如果不同的副本决定将变量放在不同的设备上,它们将不会共享相同的值),因此TensorFlow包含一个可以防止这种情况的完整性检查.

幸运的是,解决方案很简单.您可以global_stepwith tf.device(tf.train.replica_device_setter(...)):上面的块中定义,也可以添加一个小块with tf.device("/job:ps/task:0"):,如下所示:

with tf.device("/job:ps/task:0"):
    global_step = tf.Variable(0, name="global_step")
Run Code Online (Sandbox Code Playgroud)