我试图在张量流上写一个分布式变分自动编码器standalone mode.
我的群集包括3台机器,命名为m1,m2和m3.我试图在m1上运行1 ps服务器,在m2和m3上运行2个工作服务器.(分布式tensorflow文档中的示例培训师程序)在m3上,我收到以下错误消息:
Traceback (most recent call last):
File "/home/yama/mfs/ZhuSuan/examples/vae.py", line 241, in <module>
save_model_secs=600)
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 334, in __init__
self._verify_setup()
File "/mfs/yama/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 863, in _verify_setup
"their device set: %s" % op)
ValueError: When using replicas, all Variables must have their device set: name: "Variable"
op: "Variable"
attr {
key: "container"
value {
s: ""
}
}
attr {
key: "dtype"
value {
type: DT_INT32
}
}
attr {
key: "shape"
value {
shape {
}
}
}
attr {
key: "shared_name"
value {
s: ""
}
}
Run Code Online (Sandbox Code Playgroud)
这是我的代码的一部分,它定义了网络和Supervisor.
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
#set distributed device
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=clusterSpec)):
# Build the training computation graph
x = tf.placeholder(tf.float32, shape=(None, x_train.shape[1]))
optimizer = tf.train.AdamOptimizer(learning_rate=0.001, epsilon=1e-4)
with tf.variable_scope("model") as scope:
with pt.defaults_scope(phase=pt.Phase.train):
train_model = M1(n_z, x_train.shape[1])
train_vz_mean, train_vz_logstd = q_net(x, n_z)
train_variational = ReparameterizedNormal(
train_vz_mean, train_vz_logstd)
grads, lower_bound = advi(
train_model, x, train_variational, lb_samples, optimizer)
infer = optimizer.apply_gradients(grads)
#print(type(lower_bound))
# Build the evaluation computation graph
with tf.variable_scope("model", reuse=True) as scope:
with pt.defaults_scope(phase=pt.Phase.test):
eval_model = M1(n_z, x_train.shape[1])
eval_vz_mean, eval_vz_logstd = q_net(x, n_z)
eval_variational = ReparameterizedNormal(
eval_vz_mean, eval_vz_logstd)
eval_lower_bound = is_loglikelihood(
eval_model, x, eval_variational, lb_samples)
eval_log_likelihood = is_loglikelihood(
eval_model, x, eval_variational, ll_samples)
#saver = tf.train.Saver()
summary_op = tf.merge_all_summaries()
global_step = tf.Variable(0)
init_op = tf.initialize_all_variables()
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
logdir=LogDir,
init_op=init_op,
summary_op=summary_op,
# saver=saver,
global_step=global_step,
save_model_secs=600)
print("create sv done")
Run Code Online (Sandbox Code Playgroud)
我认为我的代码肯定有问题,但我不知道如何修复它.有什么建议?非常感谢!
问题源于global_step变量的定义:
global_step = tf.Variable(0)
Run Code Online (Sandbox Code Playgroud)
此定义超出了上述with tf.device(tf.train.replica_device_setter(...)):块的范围,因此未分配任何设备global_step.在复制培训中,这通常是错误的来源(因为如果不同的副本决定将变量放在不同的设备上,它们将不会共享相同的值),因此TensorFlow包含一个可以防止这种情况的完整性检查.
幸运的是,解决方案很简单.您可以global_step在with tf.device(tf.train.replica_device_setter(...)):上面的块中定义,也可以添加一个小块with tf.device("/job:ps/task:0"):,如下所示:
with tf.device("/job:ps/task:0"):
global_step = tf.Variable(0, name="global_step")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1722 次 |
| 最近记录: |