Sha*_*hai 14 machine-learning neural-network gradient-descent deep-learning caffe
当在训练期间遇到困难时(nans,损失不会收敛等),通过debug_info: true在'solver.prototxt'文件中设置来查看更详细的训练日志有时是有用的.
然后训练日志看起来像:
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer data, top blob data data: 0.343971 I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...] [Forward] Layer conv1, param blob 1 data: 0 I1109 ...] [Forward] Layer relu1, top blob conv1 data: 0.0337982 I1109 ...] [Forward] Layer conv2, top blob conv2 data: 0.0249297 I1109 ...] [Forward] Layer conv2, param blob 0 data: 0.00875855 I1109 ...] [Forward] Layer conv2, param blob 1 data: 0 I1109 ...] [Forward] Layer relu2, top blob conv2 data: 0.0128249 . . . I1109 ...] [Forward] Layer fc1, top blob fc1 data: 0.00728743 I1109 ...] [Forward] Layer fc1, param blob 0 data: 0.00876866 I1109 ...] [Forward] Layer fc1, param blob 1 data: 0 I1109 ...] [Forward] Layer loss, top blob loss data: 2031.85 I1109 ...] [Backward] Layer loss, bottom blob fc1 diff: 0.124506 I1109 ...] [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067 I1109 ...] [Backward] Layer fc1, param blob 0 diff: 0.483772 I1109 ...] [Backward] Layer fc1, param blob 1 diff: 4079.72 . . . I1109 ...] [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06 I1109 ...] [Backward] Layer conv2, param blob 0 diff: 0.00661093 I1109 ...] [Backward] Layer conv2, param blob 1 diff: 0.10995 I1109 ...] [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06 I1109 ...] [Backward] Layer conv1, param blob 0 diff: 0.0220984 I1109 ...] [Backward] Layer conv1, param blob 1 diff: 0.0429201 E1109 ...] [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)
这是什么意思?
Sha*_*hai 16
乍一看,您可以看到此日志部分分为两部分:[Forward]和[Backward].回想一下,神经网络训练是通过前向 - 后向传播完成的:
训练样例(批量)被馈送到网络,正向传递输出当前预测.
基于该预测,计算损失.然后导出损失,并使用链规则估计并向后传播梯度.
Caffe Blob数据结构
只需快速重新上限.Caffe使用Blob数据结构来存储数据/权重/参数等.对于这个讨论,重要的是要注意Blob有两个"部分":data和diff.其值Blob存储在data零件中.该diff部件用于存储反向传播步骤的逐元素梯度.
前锋传球
您将在日志的这一部分中看到从下到上列出的所有图层.对于每个图层,您将看到:
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037 I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114 I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
Layer "conv1"是一个卷积层,有2个param blob:过滤器和偏差.因此,日志有三行.过滤器blob(param blob 0)有data
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer conv1, param blob 0 data: 0.00899114
也就是说,卷积滤波器权重的当前L2范数是0.00899.
目前的偏见(param blob 1):
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer conv1, param blob 1 data: 0
意味着当前偏差设置为0.
最后但并非最不重要的是,"conv1"图层有一个输出,"top"名为"conv1"(原始...).输出的L2范数是
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer conv1, top blob conv1 data: 0.0645037
请注意,[Forward]传递的所有L2值都在data相关Blob 的部分上报告.
损失和梯度
在通过结束时[Forward]出现损失层:
Run Code Online (Sandbox Code Playgroud)I1109 ...] [Forward] Layer loss, top blob loss data: 2031.85 I1109 ...] [Backward] Layer loss, bottom blob fc1 diff: 0.124506
在此示例中,批量损失为2031.85,fc1计算损失wrt的梯度并将其传递给Blob的diff一部分fc1.梯度的L2幅度为0.1245.
向后传递
此部分的所有其余图层从上到下列出.您可以看到,现在报告的L2幅度diff属于Blob(参数和图层的输入)的一部分.
最后
这个迭代的最后一个日志行:
Run Code Online (Sandbox Code Playgroud)[Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)
报告数据和梯度的总L1和L2大小.
我应该寻找什么?
如果你有nan丢失,请看你的数据或差异在哪一点nan:在哪一层?在哪次迭代?
看看梯度幅度,它们应该是合理的.如果您开始看到e+8数据/梯度的值开始爆炸.降低你的学习率!
看到diffs不是零.零差异意味着没有渐变=没有更新=没有学习.如果您从随机权重开始,请考虑生成具有更高方差的随机权重.
寻找激活(而不是渐变)为零.如果您正在使用"ReLU"这意味着您的输入/权重将引导您到ReLU门"未激活"导致"死神经元"的区域.考虑将输入规范化为零均值,"BatchNorm"][6] layers, setting在ReLU中添加[ negative_slope`.