如何使用TensorFlow tf.train.string_input_producer生成多个epochs数据?

dan*_*che 2 python neural-network tensorflow

当我想用于tf.train.string_input_producer加载2个时期的数据时,我使用了

filename_queue = tf.train.string_input_producer(filenames=['data.csv'], num_epochs=2, shuffle=True)

col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch([col1, col2, col3], batch_size=batch_size, capacity=capacity,\min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)
Run Code Online (Sandbox Code Playgroud)

但后来我发现这个操作并没有产生我想要的东西.

它只能生成每个样品data.csv2次,但生成的顺序不清楚.例如,3行数据data.csv

[[1]
[2]
[3]]
Run Code Online (Sandbox Code Playgroud)

它会产生(每个样品只出现2次,但顺序是可选的)

[1]
[1]
[3]
[2]
[2]
[3]
Run Code Online (Sandbox Code Playgroud)

但我想要的是(每个时代都是分开的,在每个时代都是洗牌)

(epoch 1:)
[1]
[2]
[3]
(epoch 2:)
[1]
[3]
[2]
Run Code Online (Sandbox Code Playgroud)

另外,如何知道1个纪元何时完成?有一些标志变量吗?谢谢!

我的代码在这里.

import tensorflow as tf

def read_my_file_format(filename_queue):
    reader = tf.TextLineReader()
    key, value = reader.read(filename_queue)
    record_defaults = [['1'], ['1'], ['1']]  
    col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defaults, field_delim='-')
    # col1 = list(map(int, col1.split(',')))
    # col2 = list(map(int, col2.split(',')))
    return col1, col2, col3

def input_pipeline(filenames, batch_size, num_epochs=1):
  filename_queue = tf.train.string_input_producer(
    filenames, num_epochs=num_epochs, shuffle=True)
  col1,col2,col3 = read_my_file_format(filename_queue)

  min_after_dequeue = 10
  capacity = min_after_dequeue + 3 * batch_size
  col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch(
    [col1, col2, col3], batch_size=batch_size, capacity=capacity,
    min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)
  return col1_batch, col2_batch, col3_batch

filenames=['1.txt']
batch_size = 3
num_epochs = 1
a1,a2,a3=input_pipeline(filenames, batch_size, num_epochs)

with tf.Session() as sess:
  sess.run(tf.local_variables_initializer())
  # start populating filename queue
  coord = tf.train.Coordinator()
  threads = tf.train.start_queue_runners(coord=coord)
  try:
    while not coord.should_stop():
      a, b, c = sess.run([a1, a2, a3])
      print(a, b, c)
  except tf.errors.OutOfRangeError:
    print('Done training, epoch reached')
  finally:
    coord.request_stop()

  coord.join(threads) 
Run Code Online (Sandbox Code Playgroud)

我的数据就像

1,2-3,4-A
7,8-9,10-B
12,13-14,15-C
17,18-19,20-D
22,23-24,25-E
27,28-29,30-F
32,33-34,35-G
37,38-39,40-H
Run Code Online (Sandbox Code Playgroud)

mrr*_*rry 11

正如尼古拉斯所观察到的那样,tf.train.string_input_producer()API无法检测到何时达到一个纪元的结束; 相反,它将所有时期连接成一个长批.出于这个原因,我们最近添加了(在TensorFlow 1.2中)tf.contrib.dataAPI,这使得表达更复杂的流水线成为可能,包括您的用例.

以下代码段显示了如何使用tf.contrib.data以下命令编写程序:

import tensorflow as tf

def input_pipeline(filenames, batch_size):
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
    dataset = (tf.contrib.data.TextLineDataset(filenames)
               .map(lambda line: tf.decode_csv(
                    line, record_defaults=[['1'], ['1'], ['1']], field_delim='-'))
               .shuffle(buffer_size=10)  # Equivalent to min_after_dequeue=10.
               .batch(batch_size))

    # Return an *initializable* iterator over the dataset, which will allow us to
    # re-initialize it at the beginning of each epoch.
    return dataset.make_initializable_iterator() 

filenames=['1.txt']
batch_size = 3
num_epochs = 10
iterator = input_pipeline(filenames, batch_size)

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator.    
a1, a2, a3 = iterator.get_next()

with tf.Session() as sess:
    for _ in range(num_epochs):
        # Resets the iterator at the beginning of an epoch.
        sess.run(iterator.initializer)

        try:
            while True:
                a, b, c = sess.run([a1, a2, a3])
                print(a, b, c)
        except tf.errors.OutOfRangeError:
            # This will be raised when you reach the end of an epoch (i.e. the
            # iterator has no more elements).
            pass                 

        # Perform any end-of-epoch computation here.
        print('Done training, epoch reached')
Run Code Online (Sandbox Code Playgroud)

  • 例外是TensorFlow当前发出的唯一机制,表明尚未计算所请求的值.(它类似于Python使用`StopIteration`异常来表示迭代器在其自己的迭代器协议中的结束.)在一些库代码中包装它肯定是可能的,我建议在[这个中执行此操作的一种方法GitHub评论](https://github.com/tensorflow/tensorflow/issues/7951#issuecomment-303546037). (2认同)
  • 为什么不简单地'而不是sess.run(epoch_done):...`?`epoch_done`是一个由队列设置并由`iterator.initializer`重置的变量 (2认同)