复制张量流图

Question

复制张量流图

MBZ*_*MBZ 8 python tensorflow

复制TensorFlow图并保持上升状态的最佳方法是什么？

理想情况下,我想将复制的图形放在另一个设备上(例如从GPU到CPU),然后不时更新副本.

Answer 1

rda*_*olf 12

简短回答:您可能需要检查点文件(永久链接).

答案很长:

让我们清楚这里的设置.我假设您有两个设备,A和B,并且您正在接受A培训并在B上运行推理.您定期更新设备运行推理的参数,并在培训期间找到新参数.上面链接的教程是一个很好的起点.它向您展示了tf.train.Saver对象如何工作,您不需要在这里更复杂.

这是一个例子:

import tensorflow as tf

def build_net(graph, device):
  with graph.as_default():
    with graph.device(device):
      # Input placeholders
      inputs = tf.placeholder(tf.float32, [None, 784])
      labels = tf.placeholder(tf.float32, [None, 10])
      # Initialization
      w0 = tf.get_variable('w0', shape=[784,256], initializer=tf.contrib.layers.xavier_initializer())
      w1 = tf.get_variable('w1', shape=[256,256], initializer=tf.contrib.layers.xavier_initializer())
      w2 = tf.get_variable('w2', shape=[256,10], initializer=tf.contrib.layers.xavier_initializer())
      b0 = tf.Variable(tf.zeros([256]))
      b1 = tf.Variable(tf.zeros([256]))
      b2 = tf.Variable(tf.zeros([10]))
      # Inference network
      h1  = tf.nn.relu(tf.matmul(inputs, w0)+b0)
      h2  = tf.nn.relu(tf.matmul(h1,w1)+b1)
      output = tf.nn.softmax(tf.matmul(h2,w2)+b2)
      # Training network
      cross_entropy = tf.reduce_mean(-tf.reduce_sum(labels * tf.log(output), reduction_indices=[1]))
      optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)    
      # Your checkpoint function
      saver = tf.train.Saver()
      return tf.initialize_all_variables(), inputs, labels, output, optimizer, saver

Run Code Online (Sandbox Code Playgroud)

培训计划的代码:

def programA_main():
  from tensorflow.examples.tutorials.mnist import input_data
  mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
  # Build training network on device A
  graphA = tf.Graph()
  init, inputs, labels, _, training_net, saver = build_net(graphA, '/cpu:0')
  with tf.Session(graph=graphA) as sess:
    sess.run(init)
    for step in xrange(1,10000):
      batch = mnist.train.next_batch(50)
      sess.run(training_net, feed_dict={inputs: batch[0], labels: batch[1]})
      if step%100==0:
        saver.save(sess, '/tmp/graph.checkpoint')
        print 'saved checkpoint'

Run Code Online (Sandbox Code Playgroud)

...以及推理程序的代码:

def programB_main():
  from tensorflow.examples.tutorials.mnist import input_data
  mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
  # Build inference network on device B
  graphB = tf.Graph()
  init, inputs, _, inference_net, _, saver = build_net(graphB, '/cpu:0')
  with tf.Session(graph=graphB) as sess:
    batch = mnist.test.next_batch(50)

    saver.restore(sess, '/tmp/graph.checkpoint')
    print 'loaded checkpoint'
    out = sess.run(inference_net, feed_dict={inputs: batch[0]})
    print out[0]

    import time; time.sleep(2)

    saver.restore(sess, '/tmp/graph.checkpoint')
    print 'loaded checkpoint'
    out = sess.run(inference_net, feed_dict={inputs: batch[0]})
    print out[1]

Run Code Online (Sandbox Code Playgroud)

如果您启动培训计划然后启动推理程序,您将看到推理程序产生两个不同的输出(来自相同的输入批处理).这是因为它获取了训练计划检查点的参数.

现在,这个程序显然不是你的终点.我们不做任何真正的同步,你必须决定"周期性"对于检查点的含义.但是,这应该让您了解如何将参数从一个网络同步到另一个网络.

最后一个警告:这并不能意味着这两个网络是必然确定性.TensorFlow中有已知的非确定性元素(例如,这个),因此如果您需要完全相同的答案,请小心谨慎.但这是关于在多个设备上运行的硬道理.

祝好运!

归档时间：	9 年，12 月前
查看次数：	6450 次
最近记录：	9 年，1 月前