TensorFlow几个批次的平均梯度

nik*_*iko 7 machine-learning backpropagation gradient-descent tensorflow tensorflow-gpu

这可能是Tensorflow的重复:如何在批处理中获得每个实例的渐变?.无论如何,我问它,因为没有一个令人满意的答案,这里的目标有点不同.

我有一个非常大的网络,我可以放在我的GPU上,但我可以提供的最大批量大小是32.任何大于此的东西都会导致GPU耗尽内存.我想使用更大的批次以获得更精确的渐变近似值.

具体来说,假设我想通过依次喂3批32个来计算大批量96的梯度.我所知道的最好方法是使用Optimizer.compute_gradients()Optimizer.apply_gradients().这是一个小例子,它是如何工作的

import tensorflow as tf
import numpy as np

learn_rate = 0.1

W_init = np.array([ [1,2,3], [4,5,6], [7,8,9] ], dtype=np.float32)
x_init = np.array([ [11,12,13], [14,15,16], [17,18,19] ], dtype=np.float32)

X = tf.placeholder(dtype=np.float32, name="x")
W = tf.Variable(W_init, dtype=np.float32, name="w")
y = tf.matmul(X, W, name="y")
loss = tf.reduce_mean(y, name="loss")

opt = tf.train.GradientDescentOptimizer(learn_rate)
grad_vars_op = opt.compute_gradients(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

# Compute the gradients for each batch
grads_vars1 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,0]})
grads_vars2 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,1]})
grads_vars3 = sess.run(grad_vars_op, feed_dict = {X: x_init[None,2]})

# Separate the gradients from the variables
grads1 = [ grad for grad, var in grads_vars1 ]
grads2 = [ grad for grad, var in grads_vars2 ]
grads3 = [ grad for grad, var in grads_vars3 ]
varl   = [ var  for grad, var in grads_vars1 ]

# Average the gradients
grads  = [ (g1 + g2 + g3)/3 for g1, g2, g3 in zip(grads1, grads2, grads3)]

sess.run(opt.apply_gradients(zip(grads,varl)))

print("Weights after 1 gradient")
print(sess.run(W))
Run Code Online (Sandbox Code Playgroud)

现在这一切都非常丑陋且效率低下,因为前向传递正在GPU上运行,而平均值发生在CPU上,然后应用它们再次发生在GPU上.

此外,此代码抛出异常,因为gradsnp.arrays 的列表并使其工作,必须tf.placeholder为每个渐变创建一个.

我相信应该有更好更有效的方法吗?有什么建议?

Ish*_*nal 9

您可以创建trainable_variables批量渐变的副本并累积批量渐变.这里有几个简单的步骤

...
opt = tf.train.GradientDescentOptimizer(learn_rate)

# constant to scale sum of gradient
const = tf.constant(1/n_batches)
# get all trainable variables
t_vars = tf.trainable_variables()
# create a copy of all trainable variables with `0` as initial values
accum_tvars = [tf.Variable(tf.zeros_like(tv.initialized_value()),trainable=False) for t_var in t_vars]                                        
# create a op to initialize all accums vars
zero_ops = [tv.assign(tf.zeros_like(tv)) for tv in accum_tvars]

# compute gradients for a batch
batch_grads_vars = opt.compute_gradients(loss, t_vars)
# collect the (scaled by const) batch gradient into accumulated vars 
accum_ops = [accum_tvars[i].assign_add(tf.scalar_mul(const, batch_grad_var[0]) for i, batch_grad_var in enumerate(batch_grads_vars)]

# apply accums gradients 
train_step = opt.apply_gradients([(accum_tvars[i], batch_grad_var[1]) for i, batch_grad_var in enumerate(batch_grads_vars)])
# train_step = opt.apply_gradients(zip(accum_tvars, zip(*batch_grads_vars)[1])

while True:
   # initialize the accumulated gards
   sess.run(zero_ops)

   # number of batches for gradient accumulation 
   n_batches = 3
   for i in xrange(n_batches):
       sess.run(accum_ops, feed_dict={X: x_init[:, i]})

   sess.run(train_step)
Run Code Online (Sandbox Code Playgroud)

  • 好的解决方案 但看起来应该有一个平均梯度的步骤. (3认同)
  • 两个相当关键的问题:1.这通常不起作用:如果您使用的是批处理的任何操作(如BatchNorm),那么它在数学上不等同.2.我根据这个想法编写了一些代码,尽管准确地复制了渐变,但它似乎并没有真正起作用.https://gist.github.com/Multihuntr/b8cb68316842ff68cab3062740a2a730我认为我没有犯任何逻辑错误. (2认同)