小编Liu*_*Jia的帖子

分布式tensorflow tf.train.SyncReplicasOptimizer似乎未同步

我使用两个worker/replicas和一个参数服务器.喜欢

--ps_hosts='hosta.com:2222' --worker_hosts='hosta.com:2223,hostb.com:2223'
Run Code Online (Sandbox Code Playgroud)

使用tf.train.SyncReplicasOptimizer之类的

opt = tf.train.SyncReplicasOptimizer(
            opt,
            replicas_to_aggregate=2,
            replica_id=FLAGS.task_id,
            total_num_replicas=2,
            variables_to_average=variables_to_average)
Run Code Online (Sandbox Code Playgroud)

由于跨机器网络通信,从日志中我看到worker0(hosta.com:2223)比worker1(hostb.com:2223)快得多.看起来worker0没有等待来自worker1的渐变.即使在我杀死worker1的工作后,worker0仍在处理中.而worker0有许多重复的日志,如

INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.659749: step 29010, loss = 0.40(812.0 examples/sec; 0.315  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.990509: step 29010, loss = 0.59(775.3 examples/sec; 0.330  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.650522: step 29013, loss = 0.56(774.0 examples/sec; 0.331  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.989555: step 29013, loss = 0.47(756.3 examples/sec; 0.338  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.549120: step 29016, loss = 0.49(816.6 examples/sec; 0.313  sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 …
Run Code Online (Sandbox Code Playgroud)

distributed synchronized tensorflow

4
推荐指数
1
解决办法
1445
查看次数

标签 统计

distributed ×1

synchronized ×1

tensorflow ×1