正确加载模型以恢复训练(元图,ckpts)

Moo*_*dra 0 python deep-learning tensorflow

我在加载模型以恢复培训时遇到问题.我在cifar数据集上使用简单的双层NN(完全连接)进行练习.

NN设置:

#full_connected_layers


import tensorflow as tf
import numpy as np


#input _-> hidden ->


def inference(data_samples, image_pixels, hidden_units, classes, reg_constant):

        with tf.variable_scope('Layer1'):
        # Define the variables
                weights = tf.get_variable(
                  name='weights',
                  shape=[image_pixels, hidden_units],
                  initializer=tf.truncated_normal_initializer(
                    stddev=1.0 / np.sqrt(float(image_pixels))),
                  regularizer=tf.contrib.layers.l2_regularizer(reg_constant)
                )

                biases = tf.Variable(tf.zeros([hidden_units]), name='biases')

                # Define the layer's output
                hidden = tf.nn.relu(tf.matmul(data_samples, weights) + biases)


        with tf.variable_scope('Layer2'):
        # Define variables
                weights = tf.get_variable('weights', [hidden_units, classes],
                  initializer=tf.truncated_normal_initializer(
                    stddev=1.0 / np.sqrt(float(hidden_units))),
                  regularizer=tf.contrib.layers.l2_regularizer(reg_constant))

                biases = tf.Variable(tf.zeros([classes]), name='biases')

        # Define the layer's output
                logits = tf.matmul(hidden, weights) + biases

        # Define summery-operation for 'logits'-variable
        tf.summary.histogram('logits', logits)



        return logits


def loss(logits, labels):
  '''Calculates the loss from logits and labels.

  Args:
    logits: Logits tensor, float - [batch size, number of classes].
    labels: Labels tensor, int64 - [batch size].

  Returns:
    loss: Loss tensor of type float.
  '''

  with tf.name_scope('Loss'):
    # Operation to determine the cross entropy between logits and labels
    cross_entropy = tf.reduce_mean(
      tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=logits, labels=labels, name='cross_entropy'))

    # Operation for the loss function
    loss = cross_entropy + tf.add_n(tf.get_collection(
      tf.GraphKeys.REGULARIZATION_LOSSES))

    # Add a scalar summary for the loss
    tf.summary.scalar('loss', loss)

  return loss




def training(loss, learning_rate):


  # Create a variable to track the global step
  global_step = tf.Variable(0, name='global_step', trainable=False)


  train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
    loss, global_step=global_step)

  #train_step = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(
    #loss, global_step=global_step)


  return train_step




def evaluation(logits, labels):


  with tf.name_scope('Accuracy'):
    # Operation comparing prediction with true label
    correct_prediction = tf.equal(tf.argmax(logits,1), labels)

    # Operation calculating the accuracy of the predictions
    accuracy =  tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # Summary operation for the accuracy
    tf.summary.scalar('train_accuracy', accuracy)

  return accuracy
Run Code Online (Sandbox Code Playgroud)

保存的模型如下:

if (i + 1) % 500 == 0:
      saver.save(sess, MODEL_DIR, global_step=i)  
      print('Saved checkpoint')
Run Code Online (Sandbox Code Playgroud)

保存的模型文件

在这个目录中: C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes

我有以下文件以及其他model.ckpt-499.index:

model.ckpt-999.meta
model.ckpt-999.index
model.ckpt-999.data-00000-of-00001
Run Code Online (Sandbox Code Playgroud)

我尝试加载模型

import numpy as np
import tensorflow as tf
import time
from datetime import datetime
import os
import data_helpers
import full_connected_layers
import itertools


learning_rate = .0001
max_steps = 3000
batch_size = 400

checkpoint = r'C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes\model.ckpt-999'





with tf.Session() as sess:
    saver = tf.train.import_meta_graph(r'C:\Users\Moondra\Desktop\CIFAR - PROJECT' +
                                    '\\parameters_no_changes\model.ckpt-999.meta')
    saver.restore(sess, checkpoint)


data_sets = data_helpers.load_data()

images = tf.get_default_graph().get_tensor_by_name('images:0') #image placeholder
labels = tf.get_default_graph().get_tensor_by_name('image-labels:0') #placeholder
loss = tf.get_default_graph().get_tensor_by_name('Loss/add:0')
#global_step = tf.get_default_graph().get_tensor_by_name('global_step/initial_value_1:0')

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
    loss)



accuracy = tf.get_default_graph().get_tensor_by_name('Accuracy/Mean:0')




with tf.Session() as sess:

  #sess.run(tf.global_variables_initializer())
  zipped_data = zip(data_sets['images_train'], data_sets['labels_train'])
  batches = data_helpers.gen_batch(list(zipped_data), batch_size,
    max_steps)



  for i in range(max_steps):

    # Get next input data batch
    batch = next(batches)
    images_batch, labels_batch = zip(*batch)
    feed_dict = {
      images: images_batch,
      labels: labels_batch

       }

    if i % 100 == 0:
      train_accuracy = sess.run(accuracy, feed_dict=feed_dict)
      print('Step {:d}, training accuracy {:g}'.format(i, train_accuracy))




    ts,loss_  =sess.run([train_step, loss], feed_dict=feed_dict)
Run Code Online (Sandbox Code Playgroud)

错误和困惑

1)我应该使用此命令latest_checkpoint恢复:`

saver.restore(sess,tf.train.latest_checkpoint('./'))`
Run Code Online (Sandbox Code Playgroud)

我看到一些教程只指向包含.data,.index文件的文件夹.

2)这让我想到第二个问题:我应该使用什么作为第二个参数saver.restore.目前我只是指向包含这些文件的文件夹/目录

3)我没有故意初始化任何变量,因为我被告知,这会覆盖存储的权重和偏差值.这似乎导致了这个错误:

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Layer1/weights [[Node: Layer1/weights/read = Identity[T=DT_FLOAT, _class=["loc:@Layer1/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](Layer1/weights)]]

4)但是,如果我通过此代码初始化所有变量:

sess.run(tf.global_variables_initializer())

我的模型似乎从头开始训练(而不是恢复训练)

这是否意味着我应该通过get_tensor显式加载所有权重和偏差 ?如果是这样,我如何处理20多层的图层?

5)当我运行此命令时

for i in tf.get_default_graph().get_operations():
    print(i.values)
Run Code Online (Sandbox Code Playgroud)

我看到很多global_steps张量/操作,

'global_step/initial_value' type=Const>>
'global_step' type=VariableV2>>
<'global_step/Assign' type=Assign>>

global_step/read' type=Identity>>
Run Code Online (Sandbox Code Playgroud)

我试图将此变量加载到我当前的图形中,但不知道我应该get使用命令get_tensor_by_name.他们中的大多数都导致了一个不存在的错误.

6)与loss我应该加载到我的图表中的损失相同get_tensor

这些是选项:

<bound method Operation.values of <tf.Operation 'Loss/Const' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/Mean' type=Mean>>
<bound method Operation.values of <tf.Operation 'Loss/AddN' type=AddN>>
<bound method Operation.values of <tf.Operation 'Loss/add' type=Add>>
<bound method Operation.values of <tf.Operation 'Loss/loss/tags' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/loss' type=ScalarSummary>>
Run Code Online (Sandbox Code Playgroud)

6)最后,当我查看图形的所有节点时,我看到了很多渐变操作,但是我没有看到任何与之相关的节点train_step(我创建的指向Gradient Dsecent Optimizer的python变量).这是否意味着我不需要将其加载到此图表中get_tensor

谢谢.

Mih*_*uja 5

我通常做这个操作序列:

  1. 初始化

  2. 恢复

这转换为这种代码:

saver = tf.train.Saver()
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, tf.train.latest_checkpoint('./'))
    ...
Run Code Online (Sandbox Code Playgroud)

它将避免未初始化的错误,并且还原将使用检查点的值覆盖.