Tan*_*may 12 machine-learning deep-learning
当我增加/减少SGD中使用的微型批次的批次大小时,是否应该更改学习率?如果是这样,那又如何?
作为参考,我正在与某人讨论,有人说,当批量增加时,学习率应在一定程度上降低。
我的理解是,当我增加批次大小时,计算出的平均梯度会减少噪音,因此我可以保持相同的学习率或提高学习率。
另外,如果我使用自适应学习速率优化器,例如Adam或RMSProp,那么我想我可以保持学习速率不变。
请,如果我弄错了,请纠正我,并对此提供任何见解。
SvG*_*vGA 21
除了 Dmytro 的回答中提到的论文,您可以参考以下文章:Jastrz?bski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A.(2018 年 10 月)。随机梯度下降达到的最小值的宽度受学习率与批次大小比的影响。作者为学习率与批量大小的比率影响 DNN 的泛化能力的想法提供了数学和经验基础。他们表明,这个比率在 SGD 发现的最小值的宽度中起着重要作用。比率越高,最小值越宽,泛化性越好。
Dmy*_*pko 14
理论表明,将批大小乘以k时,应将学习率乘以sqrt(k)以使梯度期望的方差保持恒定。请参阅A.Krizhevsky的第5页。卷积神经网络并行化的一种怪异技巧:https : //arxiv.org/abs/1404.5997
但是,最近的大型小批量实验建议使用更简单的线性缩放规则,即,使用kN的小批量时,将学习率乘以k。参见P.Goyal等人:准确的大型微型批处理SGD:1小时内训练ImageNet https://arxiv.org/abs/1706.02677
我想说的是,使用亚当(Adam),阿达格勒(Adagrad)和其他自适应优化器,如果批量大小没有实质性变化,学习率可能保持不变。
Ent*_*Fox 11
I've always found the heuristics which seem to vary somewhere between scale with the square root of the batch size and the batch size to be a bit hand-wavy and fluffy, as is often the case in Deep Learning. Hence I devised my own theoretical framework to answer this question. If you are interested the full pre-print is here (currently under review) https://arxiv.org/abs/2006.09092
Let me start with two small sub-questions, which answer the main question
Yes, for the convex quadratic, the optimal learning rate is given as 2/(?+?), where ?,? represent the largest and smallest eigenvalues of the Hessian (Hessian = the second derivative of the loss ??L, which is a matrix) respectively.
This is actually a little more tricky to answer (it is what I made the theory for in the first place), but it goes something like this.
Let us imagine that we have all the data and that would give us the full Hessian H. But now instead we only sub-sample this Hessian so we use a batch Hessian B. We can simply re-write B=H+(B-H)=H+E. Where E is now some error or fluctuation matrix.
Under some technical assumptions on the nature of the elements of E, we can assume this fluctations to be a zero mean random matrix and so the Batch Hessian becomes a fixed matrix + a random matrix.
For this model, the change in eigenvalues (which determines how large the learning rate can be) is known. In my paper there is another more fancy model, but the answer is more or less the same.
I attach a plot of what happens in the case that the largest eigenvalue from the full data matrix is far outside that of the noise matrix (usually the case). As we increase the mini-batch size, the size of the noise matrix decreases and so the largest eigenvalue also decreases in size, hence larger learning rates can be used. This effect is initially proportional and continues to be approximately proportional until a threshold after which no appreciable decrease happens.
How well does this hold in practice? The answer as shown below in my plot on the VGG-16 without batch norm (see paper for batch normalisation and resnets), is very well.
我会赶紧补充一点,对于自适应顺序方法,参数有点不同,因为您有特征值、估计特征值和估计特征向量的相互作用!所以你实际上最终得到了一个直到阈值的平方根规则。老实说,为什么没有人讨论这个或发表这个结果,这有点超出我的理解。
但是,如果您想要我的实用建议,请坚持使用 SGD,并且如果您的批次规模较小,则与批次规模的增加成正比,然后不要将其增加到某个点以上。
归档时间: |
|
查看次数: |
4539 次 |
最近记录: |