使用 Adam 优化器时,PyTorch 与 TensorFlow 相比收敛程度欠佳

FFT*_*FFT 13 gradient-descent deep-learning tensorflow pytorch

我在 PyTorch 中训练模型的程序的收敛性比 TensorFlow 实现的要差。当我改用 SGD 而不是 Adam 时,损失是相同的。对于 Adam,损失从第一个时期开始就不同了。我相信我在两个程序中使用相同的设置。关于如何调试这个的任何想法都会有帮助!

使用 SGD 计算的损失

火炬

0.1504615843296051
0.10858417302370071
0.08603279292583466
Run Code Online (Sandbox Code Playgroud)

TensorFlow

0.15046157
0.108584
0.08603277
Run Code Online (Sandbox Code Playgroud)

使用 Adam 的损失

火炬

0.0031117501202970743
0.0020642257295548916
0.0019268309697508812
0.0016333406092599034
0.0017334128497168422
0.0014430736191570759
0.0010424457723274827
0.0012145100627094507
0.0011195113183930516
0.0009501167223788798
0.0009987876983359456
0.0007953296881169081
0.00075263757025823
0.0008374055614694953
0.000735406531020999
Run Code Online (Sandbox Code Playgroud)

张力流:

0.0036667113
0.0032563617
0.0021536187
0.0015266595
0.0013580231
0.0013878695
0.0011856346
0.0011136091
0.00091276
0.000890126
0.00088381825
0.0007283067
0.00081382995
0.0006670901
0.00046282331
Run Code Online (Sandbox Code Playgroud)

Adam 优化器设置

TF 1.15.3:

adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)

# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam")
Run Code Online (Sandbox Code Playgroud)

火炬

torch.optim.Adam(params=model.parameters(), lr=5e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0)
Run Code Online (Sandbox Code Playgroud)

训练

  • 我从文件中加载了相同的权重来初始化两个模型。
  • 我对单个数据样本进行了训练和测试,该数据样本也是从文件加载的。我使用 1000 次迭代进行训练,1 次迭代进行测试,批量大小为 1。

事前调试

  • 如上所述,我使用了相同的参数和数据
  • 我使用 Adam 优化器运行了一次前向-后向传递,并保存了每一层的数据和梯度。我绘制了结果。所有看起来都一样,并且彼此之间的距离在 1e-6 到 1e-10 之内。在舍入误差范围内,损失也是相同的。

保存和加载 PyTorch 模型

def train(...):
    ...
    checkpoint = torch.load(checkpoint_file, map_location=device)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    ...
    counter = 0
    while run:
            counter += 1
            if counter > 1000:
                break

            in = np.load("debug_data/in.npy")
            out1 = np.load("debug_data/out1.npy")
            out2 = np.load("debug_data/out2.npy")

            # adjust from TF
            in = in.squeeze(3)
            in = np.expand_dims(in, axis=0)
            ... do the same for out1 and out2

        in, out1, out2 = \
                torch.from_numpy(in).to(device), \
                torch.from_numpy(out1).to(device), \
                torch.from_numpy(out2).to(device)

        optimizer.zero_grad()
        out1_hat, out2_hat = model(in)

        train_loss = loss_fn(out1_hat, out1) + loss_fn(out2_hat, out2)
        train_loss.backward()

        optimizer.step()

    save_checkpoint({'state_dict': model.state_dict(),
                    'optimizer': optimizer.state_dict()},
                    latest_filename=latest_checkpoint_path)
Run Code Online (Sandbox Code Playgroud)

保存和加载 TensorFlow 模型

sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(my_path, graph=sess.graph)

restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, load_path)

saver = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)

counter = 0
while run:
    counter += 1
    if counter > 1000:
        break

    in = np.load("")
    out1 = np.load("")
    out2 = np.load("")
    out1 = out1[0, :, :, :]
    out1 = out1[:, :, :, np.newaxis]
    out2 = out2[0, :, :, :]
    out2 = out2[:, :, :, np.newaxis]
    in = in[0, :, :, :]
    in = in[:, :, :, np.newaxis]
    _, _loss = sess.run([optimizer, loss],
    feed_dict={in: in, out1: out1, out2: out2})

save_path = saver.save(sess, my_save_path, global_step=int(_global_step))

sess.close()
tf.reset_default_graph()
Run Code Online (Sandbox Code Playgroud)

Kyl*_*e C 0

TF 中的默认 epsilon 是 1e-7 而不是 1e-8。请参阅此处此处