copy_initial_weights 文档在 Pytorch 的更高库中是什么意思?

Cha*_*ker 15 machine-learning deep-learning pytorch

我试图使用更高的库进行元学习,但我在理解其copy_initial_weights含义时遇到了问题。文档说:

copy_initial_weights - 如果为真,则修补模块的权重被复制以形成修补模块的初始权重,因此在展开修补模块时不是梯度带的一部分。如果将其设置为 False,则实际模块权重将是修补模块的初始权重。例如,这在执行 MAML 时很有用。

但这对我来说没有多大意义,因为以下几点:

例如,“修补模块的权重被复制以形成修补模块的初始权重”对我来说没有意义,因为当上下文管理器启动时,修补模块还不存在。所以不清楚我们从哪里复制什么(以及为什么复制是我们想要做的事情)。

此外,“展开修补模块”对我来说没有意义。我们通常展开由 for 循环引起的计算图。一个补丁模块只是一个被这个库修改过的神经网络。展开是模棱两可的。

此外,“渐变胶带”没有技术定义。

此外,在描述 false 是什么时,说它对 MAML 有用实际上并没有用,因为它甚至没有暗示为什么它对 MAML 有用。

总的来说,不可能使用上下文管理器。

以更精确的术语解释该标志的作用的任何解释和示例都将非常有价值。


有关的:

Ale*_*rov 2

简洁版本

higher.innerloop_ctx以参数调用model创建临时修补模型并为该模型展开优化器:(fmodel, diffopt)。预计在内循环中,fmodel将迭代地接收一些输入,计算输出和损失,然后diffopt.step(loss)被调用。每次diffopt.step调用fmodel都会创建下一个版本的参数fmodel.parameters(time=T),这是使用以前的张量计算的新张量(完整的图允许通过该过程计算梯度)。如果用户在任何时候调用backward任何张量,常规的 pytorch 梯度计算/累加将以某种方式开始,允许梯度传播到例如优化器的参数(例如lrmomentum- 如果它们作为需要梯度的张量传递到higher.innerloop_ctx使用override)。

fmodel的参数的创建时版本fmodel.parameters(time=0)是原始参数的副本model。如果copy_initial_weights=True提供(默认),则将fmodel.parameters(time=0)是s 参数的clone+ detach'ed 版本model(即,将保留值,但将切断与原始模型的所有连接)。如果copy_initial_weights=False提供,则将fmodel.parameters(time=0)clone的参数的 d 版本model,从而允许梯度传播到原始model的参数(请参阅pytorch 文档clone

术语澄清

  • 这里的梯度带指的是 pytorch 用来进行计算以将梯度传播到所有需要梯度的叶张量的图。如果在某个时刻您切断了与某些需要参数的叶张量的链接(例如,它是如何完成案例的fnet.parameters()copy_initial_weights=True,那么原始数据model.parameters()将不再“在梯度带上”用于您的meta_loss.backward()计算。

  • 这里展开补丁模块meta_loss.backward()指的是当pytorchfnet.parameters(time=T)从最新开始到最早结束时的计算部分(higher不控制过程 - 这只是常规的pytorch梯度计算,higher只是负责这些新的梯度计算)每次调用time=T时都会根据以前的参数创建参数,并且如何始终使用最新的参数进行前向计算)。diffopt.stepfnet

长版

让我们从头开始。库的主要功能(实际上只是功能)higher是以可微分的方式展开模型的参数优化。它可以以直接使用可微优化器的形式出现,例如higher.get_diff_optim本例所示higher.innerloop_ctx,或者以如本例所示的形式出现。

该选项higher.innerloop_ctx是为您包装从现有模型创建“无状态”模型,并为fmodel您提供一个“优化器” 。因此,正如更高版本的 README.md 中所总结的,它允许您从以下位置切换:diffoptfmodel

model = MyModel()
opt = torch.optim.Adam(model.parameters())

for xs, ys in data:
    opt.zero_grad()
    logits = model(xs)
    loss = loss_function(logits, ys)
    loss.backward()
    opt.step()
Run Code Online (Sandbox Code Playgroud)

model = MyModel()
opt = torch.optim.Adam(model.parameters())

with higher.innerloop_ctx(model, opt) as (fmodel, diffopt):
    for xs, ys in data:
        logits = fmodel(xs)  # modified `params` can also be passed as a kwarg
        loss = loss_function(logits, ys)  # no need to call loss.backwards()
        diffopt.step(loss)  # note that `step` must take `loss` as an argument!

    # At the end of your inner loop you can obtain these e.g. ...
    grad_of_grads = torch.autograd.grad(
        meta_loss_fn(fmodel.parameters()), fmodel.parameters(time=0))
Run Code Online (Sandbox Code Playgroud)

model训练和diffopt.step更新之间的区别fmodel在于,训练并不像原始部分那样fmodel就地更新参数。opt.step()相反,每次diffopt.step调用时都会以这样的方式创建新版本的参数,即fmodel下一步将使用新的参数,但所有先前的参数仍会保留。

即一fmodel开始只有fmodel.parameters(time=0)可用的,但是在你打电话diffopt.stepN次之后你可以要求fmodel给你fmodel.parameters(time=i)任何i最多的N包容性。请注意,fmodel.parameters(time=0)在此过程中根本没有改变,只是每次fmodel应用于某些输入时,它都会使用当前拥有的最新版本的参数。

现在,到底是什么fmodel.parameters(time=0)它在这里创建并依赖于copy_initial_weights. 如果copy_initial_weights==True那么fmodel.parameters(time=0)是 的clone'd 和detach'ed 参数model。否则它们只是clone'd,而不是detach'ed!

这意味着当我们进行元优化步骤时,model当且仅当 时,原始参数实际上会累积梯度copy_initial_weights==False。在 MAML 中,我们想要优化 的model起始权重,因此我们实际上需要从元优化步骤中获取梯度。

我认为这里的问题之一是higher缺乏更简单的玩具示例来演示正在发生的事情,而是急于展示更严肃的事情作为示例。因此,让我尝试填补这里的空白,并使用我能想到的最简单的玩具示例来演示正在发生的事情(权重为 1 的模型,将输入乘以该权重):

import torch
import torch.nn as nn
import torch.optim as optim
import higher
import numpy as np

np.random.seed(1)
torch.manual_seed(3)
N = 100
actual_multiplier = 3.5
meta_lr = 0.00001
loops = 5 # how many iterations in the inner loop we want to do

x = torch.tensor(np.random.random((N,1)), dtype=torch.float64) # features for inner training loop
y = x * actual_multiplier # target for inner training loop
model = nn.Linear(1, 1, bias=False).double() # simplest possible model - multiple input x by weight w without bias
meta_opt = optim.SGD(model.parameters(), lr=meta_lr, momentum=0.)


def run_inner_loop_once(model, verbose, copy_initial_weights):
    lr_tensor = torch.tensor([0.3], requires_grad=True)
    momentum_tensor = torch.tensor([0.5], requires_grad=True)
    opt = optim.SGD(model.parameters(), lr=0.3, momentum=0.5)
    with higher.innerloop_ctx(model, opt, copy_initial_weights=copy_initial_weights, override={'lr': lr_tensor, 'momentum': momentum_tensor}) as (fmodel, diffopt):
        for j in range(loops):
            if verbose:
                print('Starting inner loop step j=={0}'.format(j))
                print('    Representation of fmodel.parameters(time={0}): {1}'.format(j, str(list(fmodel.parameters(time=j)))))
                print('    Notice that fmodel.parameters() is same as fmodel.parameters(time={0}): {1}'.format(j, (list(fmodel.parameters())[0] is list(fmodel.parameters(time=j))[0])))
            out = fmodel(x)
            if verbose:
                print('    Notice how `out` is `x` multiplied by the latest version of weight: {0:.4} * {1:.4} == {2:.4}'.format(x[0,0].item(), list(fmodel.parameters())[0].item(), out[0].item()))
            loss = ((out - y)**2).mean()
            diffopt.step(loss)

        if verbose:
            # after all inner training let's see all steps' parameter tensors
            print()
            print("Let's print all intermediate parameters versions after inner loop is done:")
            for j in range(loops+1):
                print('    For j=={0} parameter is: {1}'.format(j, str(list(fmodel.parameters(time=j)))))
            print()

        # let's imagine now that our meta-learning optimization is trying to check how far we got in the end from the actual_multiplier
        weight_learned_after_full_inner_loop = list(fmodel.parameters())[0]
        meta_loss = (weight_learned_after_full_inner_loop - actual_multiplier)**2
        print('  Final meta-loss: {0}'.format(meta_loss.item()))
        meta_loss.backward() # will only propagate gradient to original model parameter's `grad` if copy_initial_weight=False
        if verbose:
            print('  Gradient of final loss we got for lr and momentum: {0} and {1}'.format(lr_tensor.grad, momentum_tensor.grad))
            print('  If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller')
        return meta_loss.item()

print('=================== Run Inner Loop First Time (copy_initial_weights=True) =================\n')
meta_loss_val1 = run_inner_loop_once(model, verbose=True, copy_initial_weights=True)
print("\nLet's see if we got any gradient for initial model parameters: {0}\n".format(list(model.parameters())[0].grad))

print('=================== Run Inner Loop Second Time (copy_initial_weights=False) =================\n')
meta_loss_val2 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)
print("\nLet's see if we got any gradient for initial model parameters: {0}\n".format(list(model.parameters())[0].grad))

print('=================== Run Inner Loop Third Time (copy_initial_weights=False) =================\n')
final_meta_gradient = list(model.parameters())[0].grad.item()
# Now let's double-check `higher` library is actually doing what it promised to do, not just giving us
# a bunch of hand-wavy statements and difficult to read code.
# We will do a simple SGD step using meta_opt changing initial weight for the training and see how meta loss changed
meta_opt.step()
meta_opt.zero_grad()
meta_step = - meta_lr * final_meta_gradient # how much meta_opt actually shifted inital weight value
meta_loss_val3 = run_inner_loop_once(model, verbose=False, copy_initial_weights=False)

meta_loss_gradient_approximation = (meta_loss_val3 - meta_loss_val2) / meta_step

print()
print('Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: {0:.4} VS {1:.4}'.format(meta_loss_gradient_approximation, final_meta_gradient))
Run Code Online (Sandbox Code Playgroud)

产生以下输出:

=================== Run Inner Loop First Time (copy_initial_weights=True) =================

Starting inner loop step j==0
    Representation of fmodel.parameters(time=0): [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=0): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.9915 == -0.4135
Starting inner loop step j==1
    Representation of fmodel.parameters(time=1): [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=1): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * -0.1217 == -0.05075
Starting inner loop step j==2
    Representation of fmodel.parameters(time=2): [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=2): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 1.015 == 0.4231
Starting inner loop step j==3
    Representation of fmodel.parameters(time=3): [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=3): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.064 == 0.8607
Starting inner loop step j==4
    Representation of fmodel.parameters(time=4): [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    Notice that fmodel.parameters() is same as fmodel.parameters(time=4): True
    Notice how `out` is `x` multiplied by the latest version of weight: 0.417 * 2.867 == 1.196

Let's print all intermediate parameters versions after inner loop is done:
    For j==0 parameter is: [tensor([[-0.9915]], dtype=torch.float64, requires_grad=True)]
    For j==1 parameter is: [tensor([[-0.1217]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==2 parameter is: [tensor([[1.0145]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==3 parameter is: [tensor([[2.0640]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==4 parameter is: [tensor([[2.8668]], dtype=torch.float64, grad_fn=<AddBackward0>)]
    For j==5 parameter is: [tensor([[3.3908]], dtype=torch.float64, grad_fn=<AddBackward0>)]

  Final meta-loss: 0.011927987982895929
  Gradient of final loss we got for lr and momentum: tensor([-1.6295]) and tensor([-0.9496])
  If you change number of iterations "loops" to much larger number final loss will be stable and the values above will be smaller

Let's see if we got any gradient for initial model parameters: None

=================== Run Inner Loop Second Time (copy_initial_weights=False) =================

  Final meta-loss: 0.011927987982895929

Let's see if we got any gradient for initial model parameters: tensor([[-0.0053]], dtype=torch.float64)

=================== Run Inner Loop Third Time (copy_initial_weights=False) =================

  Final meta-loss: 0.01192798770078706

Side-by-side meta_loss_gradient_approximation and gradient computed by `higher` lib: -0.005311 VS -0.005311
Run Code Online (Sandbox Code Playgroud)