我使用两个worker/replicas和一个参数服务器.喜欢
--ps_hosts='hosta.com:2222' --worker_hosts='hosta.com:2223,hostb.com:2223'
Run Code Online (Sandbox Code Playgroud)
使用tf.train.SyncReplicasOptimizer之类的
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=2,
replica_id=FLAGS.task_id,
total_num_replicas=2,
variables_to_average=variables_to_average)
Run Code Online (Sandbox Code Playgroud)
由于跨机器网络通信,从日志中我看到worker0(hosta.com:2223)比worker1(hostb.com:2223)快得多.看起来worker0没有等待来自worker1的渐变.即使在我杀死worker1的工作后,worker0仍在处理中.而worker0有许多重复的日志,如
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.659749: step 29010, loss = 0.40(812.0 examples/sec; 0.315 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.990509: step 29010, loss = 0.59(775.3 examples/sec; 0.330 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.650522: step 29013, loss = 0.56(774.0 examples/sec; 0.331 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.989555: step 29013, loss = 0.47(756.3 examples/sec; 0.338 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.549120: step 29016, loss = 0.49(816.6 examples/sec; 0.313 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 …Run Code Online (Sandbox Code Playgroud)