随着批次大小的变化,学习率应如何变化?

Tan*_*may 12 machine-learning deep-learning

当我增加/减少SGD中使用的微型批次的批次大小时,是否应该更改学习率?如果是这样,那又如何?

作为参考,我正在与某人讨论,有人说,当批量增加时,学习率应在一定程度上降低。

我的理解是,当我增加批次大小时,计算出的平均梯度会减少噪音,因此我可以保持相同的学习率或提高学习率。

另外,如果我使用自适应学习速率优化器,例如Adam或RMSProp,那么我想我可以保持学习速率不变。

请,如果我弄错了,请纠正我,并对此提供任何见解。

SvG*_*vGA 21

除了 Dmytro 的回答中提到的论文,您可以参考以下文章:Jastrz?bski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A.(2018 年 10 月)。随机梯度下降达到的最小值的宽度受学习率与批次大小比的影响。作者为学习率与批量大小的比率影响 DNN 的泛化能力的想法提供了数学和经验基础。他们表明,这个比率在 SGD 发现的最小值的宽度中起着重要作用。比率越高,最小值越宽,泛化性越好。

  • 不确定这是否回答了如何通过批量大小改变学习率的问题?它只是指出,学习率与批量大小的比率如果较大,则可以提供更大的泛化能力。 (2认同)

Dmy*_*pko 14

理论表明,将批大小乘以k时,应将学习率乘以sqrt(k)以使梯度期望的方差保持恒定。请参阅A.Krizhevsky的第5页卷积神经网络并行化的一种怪异技巧https : //arxiv.org/abs/1404.5997

但是,最近的大型小批量实验建议使用更简单的线性缩放规则,即,使用kN的小批量时,将学习率乘以k。参见P.Goyal等人:准确的大型微型批处理SGD:1小时内训练ImageNet https://arxiv.org/abs/1706.02677

我想说的是,使用亚当(Adam),阿达格勒(Adagrad)和其他自适应优化器,如果批量大小没有实质性变化,学习率可能保持不变。

  • 如果批量大小没有实质性变化,您在哪里可以认为学习率可能保持不变?在理论上或实践中从未见过这种情况。 (4认同)
  • 每一篇关于深度学习(使用 cifar 或 mnist 等数据集)的文章都应该以“理论表明......”这句话开头。 (3认同)

Ent*_*Fox 11

Learning Rate Scaling for Dummies

I've always found the heuristics which seem to vary somewhere between scale with the square root of the batch size and the batch size to be a bit hand-wavy and fluffy, as is often the case in Deep Learning. Hence I devised my own theoretical framework to answer this question. If you are interested the full pre-print is here (currently under review) https://arxiv.org/abs/2006.09092

Learning Rate is a function of the Largest Eigenvalue

Let me start with two small sub-questions, which answer the main question

  • Are there any cases where we can a priori know the optimal learning rate?

Yes, for the convex quadratic, the optimal learning rate is given as 2/(?+?), where ?,? represent the largest and smallest eigenvalues of the Hessian (Hessian = the second derivative of the loss ??L, which is a matrix) respectively.

  • How do we expect these eigenvalues (which represent how much the loss changes along a infinitesimal move in the direction of the eigenvectors) to change as a function of batch size?

This is actually a little more tricky to answer (it is what I made the theory for in the first place), but it goes something like this.

Let us imagine that we have all the data and that would give us the full Hessian H. But now instead we only sub-sample this Hessian so we use a batch Hessian B. We can simply re-write B=H+(B-H)=H+E. Where E is now some error or fluctuation matrix.

Under some technical assumptions on the nature of the elements of E, we can assume this fluctations to be a zero mean random matrix and so the Batch Hessian becomes a fixed matrix + a random matrix.

For this model, the change in eigenvalues (which determines how large the learning rate can be) is known. In my paper there is another more fancy model, but the answer is more or less the same.

What actually happens? Experiments and Scaling Rules

I attach a plot of what happens in the case that the largest eigenvalue from the full data matrix is far outside that of the noise matrix (usually the case). As we increase the mini-batch size, the size of the noise matrix decreases and so the largest eigenvalue also decreases in size, hence larger learning rates can be used. This effect is initially proportional and continues to be approximately proportional until a threshold after which no appreciable decrease happens.

小批量处理的最大特征值变化

How well does this hold in practice? The answer as shown below in my plot on the VGG-16 without batch norm (see paper for batch normalisation and resnets), is very well.

在此处输入图片说明

我会赶紧补充一点,对于自适应顺序方法,参数有点不同,因为您有特征值、估计特征值和估计特征向量的相互作用!所以你实际上最终得到了一个直到阈值的平方根规则。老实说,为什么没有人讨论这个或发表这个结果,这有点超出我的理解。

在此处输入图片说明

但是,如果您想要我的实用建议,请坚持使用 SGD,并且如果您的批次规模较小,则与批次规模的增加成正比,然后不要将其增加到某个点以上。

  • 令人印象深刻的答案。https://arxiv.org/abs/2103.00065 中的结果证实还是反驳你的模型? (2认同)
  • “你介意更具体地说明哪些结果吗?” 我指的是这一点:>“在稳定性边缘,真实神经训练目标上的梯度下降行为与二次泰勒近似上的梯度下降行为有着不可调和的不同:前者取得了一致(如果不稳定)的进展,而后者会发散(并且这种发散会很快发生,正如我们在附录 D 中演示的那样)。因此,稳定边缘处的梯度下降行为本质上是非二次的”。 (2认同)