Max*_*rel 8 python distributed cluster-computing server tensorflow
我对tensorflow的新选项有一些麻烦,它允许我们运行分布式张量流.
我只想运行2个tf.constant和2个任务,但我的代码永远不会结束.它看起来像那样:
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222", "localhost:2223"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)
with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
with tf.device("/job:local/replica:0/task:1"):
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])
Run Code Online (Sandbox Code Playgroud)
我有以下代码(只有一个localhost:2222):
import tensorflow as tf
cluster = tf.train.ClusterSpec({"local": ["localhost:2222"]})
server = tf.train.Server(cluster,
job_name="local",
task_index=0)
with tf.Session(server.target) as sess:
with tf.device("/job:local/replica:0/task:0"):
const1 = tf.constant("Hello I am the first constant")
const2 = tf.constant("Hello I am the second constant")
print sess.run([const1, const2])
out : ['Hello I am the first constant', 'Hello I am the second constant']
Run Code Online (Sandbox Code Playgroud)
也许我不明白这些功能......所以,如果你有一个想法,请告诉我.
谢谢 ;).
编辑
好吧,我发现使用ipython笔记本无法像我一样运行它.我需要一个python程序并使用终端执行它.但是现在我在运行代码时出现了一个新问题,现在服务器尝试连接到给定的2个端口,而我告诉他只运行一个.我的新代码如下所示:
import tensorflow as tf
tf.app.flags.DEFINE_string('job_name', '', 'One of local worker')
tf.app.flags.DEFINE_string('local', '', """Comma-separated list of hostname:port for the """)
tf.app.flags.DEFINE_integer('task_id', 0, 'Task ID of local/replica running the training')
tf.app.flags.DEFINE_integer('constant_id', 0, 'the constant we want to run')
FLAGS = tf.app.flags.FLAGS
local_host = FLAGS.local.split(',')
cluster = tf.train.ClusterSpec({"local": local_host})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_id)
with tf.Session(server.target) as sess:
if(FLAGS.constant_id == 0):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const1 = tf.constant("Hello I am the first constant")
print sess.run(const1)
if (FLAGS.constant_id == 1):
with tf.device('/job:local/task:'+str(FLAGS.task_id)):
const2 = tf.constant("Hello I am the second constant")
print sess.run(const2)
Run Code Online (Sandbox Code Playgroud)
我运行以下命令行
python test_distributed_tensorflow.py --local=localhost:3000,localhost:3001 --job_name=local --task_id=0 --constant_id=0
Run Code Online (Sandbox Code Playgroud)
我得到以下日志
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0)
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job local -> {localhost:3000, localhost:3001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:3000
E0518 15:27:11.794873779 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
E0518 15:27:12.795184395 10884 tcp_client_posix.c:173] failed to connect to 'ipv4:127.0.0.1:3001': socket error: connection refused
...
Run Code Online (Sandbox Code Playgroud)
编辑2
我找到了解决方案.只需运行我们提供给服务器的所有任务.所以我必须运行这个:
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=0 --constant_id=0 \
& \
python test_distributed_tensorflow.py --local=localhost:2345,localhost:2346 --job_name=local --task_id=1 --constant_id=1
Run Code Online (Sandbox Code Playgroud)
我希望可以帮助别人;)
| 归档时间: |
|
| 查看次数: |
1184 次 |
| 最近记录: |