GradientTape.gradient 的概念理解

Rob*_* L. 4 python neural-network deep-learning tensorflow2.0

背景

在 Tensorflow 2 中,存在一个名为 的类GradientTape,用于记录张量上的操作,然后可以对结果进行微分并馈送到某种最小化算法。例如,从文档中我们有这个例子:

x = tf.constant(3.0)
with tf.GradientTape() as g:
  g.watch(x)
  y = x * x
dy_dx = g.gradient(y, x) # Will compute to 6.0
Run Code Online (Sandbox Code Playgroud)

该方法的文档字符串意味着gradient第一个参数不仅可以是张量,还可以是张量列表:

 def gradient(self,
               target,
               sources,
               output_gradients=None,
               unconnected_gradients=UnconnectedGradients.NONE):
    """Computes the gradient using operations recorded in context of this tape.

    Args:
      target: a list or nested structure of Tensors or Variables to be
        differentiated.
      sources: a list or nested structure of Tensors or Variables. `target`
        will be differentiated against elements in `sources`.
      output_gradients: a list of gradients, one for each element of
        target. Defaults to None.
      unconnected_gradients: a value which can either hold 'none' or 'zero' and
        alters the value which will be returned if the target and sources are
        unconnected. The possible values and effects are detailed in
        'UnconnectedGradients' and it defaults to 'none'.

    Returns:
      a list or nested structure of Tensors (or IndexedSlices, or None),
      one for each element in `sources`. Returned structure is the same as
      the structure of `sources`.

    Raises:
      RuntimeError: if called inside the context of the tape, or if called more
       than once on a non-persistent tape.
      ValueError: if the target is a variable or if unconnected gradients is
       called with an unknown value.
    """
Run Code Online (Sandbox Code Playgroud)

在上面的例子中,很容易看出ytarget, 是要微分的函数,并且x是“梯度”所对应的因变量。

根据我有限的经验,该gradient方法似乎返回一个张量列表,每个元素一个张量sources,并且每个梯度都是一个与相应成员形状相同的张量sources

问题

上面对 行为的描述gradients是有意义的,如果target是有意义的,因为从数学上讲,梯度向量应该与函数的域具有相同的维度。

但是,如果target是张量列表,则 的输出gradients仍然是相同的形状。为什么会这样呢?如果target将其视为函数列表,那么输出不应该类似于雅可比行列式吗?我该如何从概念上解释这种行为?

Vla*_*lad 5

是这样tf.GradientTape().gradient()定义的。它具有与 相同的功能tf.gradients(),只是后者不能在 eager 模式下使用。来自以下文档tf.gradients()

它返回一个长度为张量的列表len(xs),其中每个张量都是sum(dy/dx) for y in ys

xs是哪里sourcesystarget.

示例1

那么我们就说target = [y1, y2]sources = [x1, x2]。结果将是:

[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
Run Code Online (Sandbox Code Playgroud)

示例2

计算每个样本损失(张量)与减少损失(标量)的梯度

Let w, b be two variables. 
xentropy = [y1, y2] # tensor
reduced_xentropy = 0.5 * (y1 + y2) # scalar
grads = [dy1/dw + dy2/dw, dy1/db + dy2/db]
reduced_grads = [d(reduced_xentropy)/dw, d(reduced_xentropy)/db]
              = [d(0.5 * (y1 + y2))/dw, d(0.5 * (y1 + y2))/db] 
              == 0.5 * grads
Run Code Online (Sandbox Code Playgroud)

上述代码片段的 Tensorflow 示例:

[dy1/dx1 + dy2/dx1, dy1/dx2 + dy2/dx2]
Run Code Online (Sandbox Code Playgroud)

如果计算xentropy批次中每个元素的损失 ( ),则每个变量的最终梯度将是批次中每个样本的所有梯度之和(这是有道理的)。