Nan在摘要直方图中

all*_*len 6 tensorflow

我的程序将面对这个问题(不是每次运行都会遇到这个......),然后如果面对这个我总是可以重现这个错误加载从我在程序崩溃之前保存的最后一个模型由于nan.当从这个模型重新运行时,第一次训练过程似乎很好,使用模型来产生损失(我有打印损失并且没有问题),但是在应用渐变之后,嵌入变量的值将变为Nan.

那么纳问题的根本原因是什么?由于不知道如何进一步调试而感到困惑,这个程序使用相同的数据和参数将大部分运行正常,并且只在某些运行期间遇到此问题.

Loading existing model from: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000
Train from restored model: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5235 get requests, put_count=4729 evicted_count=1000 eviction_rate=0.211461 and unsatisfied allocation rate=0.306781
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110
2016-10-04 21:45:39 epoch:1.87 train_step:18001 duration:0.947 elapsed:0.947 train_avg_metrics:['loss:0.527']  ['loss:0.527']
2016-10-04 21:45:39 epoch:1.87 eval_step: 18001 duration:0.001 elapsed:0.948 ratio:0.001
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1
     [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]]
Traceback (most recent call last):
  File "./train.py", line 308, in <module>
    tf.app.run()
Run Code Online (Sandbox Code Playgroud)

小智 11

有时在训练的初始迭代期间,模型可能只喷出一个预测类.如果没有随机机会,对于所有训练样例,该类结果为0,那么可能存在分类交叉熵损失的NaN值.

确保在计算损失时引入一个小值,例如tf.log(predictions + 1e-8).这将有助于克服这种数值不稳定性.


Dmi*_*yal 8

通常NaN是模型不稳定的标志,例如爆炸梯度.它可能会被忽视,损失就会停止萎缩.尝试记录权重摘要会使问题显而易见.我建议你降低学习率作为第一项措施.如果没用,请在此处发布您的代码.没有看到它,很难提出更具体的建议.


Ale*_*nko 8

我遇到了类似的错误,并尝试了不同的学习率、批量大小、损失函数和模型架构,但没有任何运气。但后来我注意到,如果我不使用 TensorBoard 回调,我可以很好地训练我的模型。看起来“Nan in summary histogram”是指保存模型权重直方图,这在某种程度上使这些 Nan 变得明确。

关闭 TensorBoard 回调中的直方图解决了我的问题:

tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=0)
Run Code Online (Sandbox Code Playgroud)