May*_*ybe 6 python tensorflow tensorflow-datasets
I follow this instruction and write the following code to create a Dataset for images(COCO2014 training set)
from pathlib import Path
import tensorflow as tf
def image_dataset(filepath, image_size, batch_size, norm=True):
def preprocess_image(image):
image = tf.image.decode_jpeg(image, channels=3)
image = tf.image.resize(image, image_size)
if norm:
image /= 255.0 # normalize to [0,1] range
return image
def load_and_preprocess_image(path):
image = tf.read_file(path)
return preprocess_image(image)
all_image_paths = [str(f) for f in Path(filepath).glob('*')]
path_ds = tf.data.Dataset.from_tensor_slices(all_image_paths)
ds = path_ds.map(load_and_preprocess_image, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.shuffle(buffer_size = len(all_image_paths))
ds = ds.repeat()
ds = ds.batch(batch_size)
ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
return ds
ds = image_dataset(train2014_dir, (256, 256), 4, False)
image = ds.make_one_shot_iterator().get_next('images')
# image is then fed to the network
Run Code Online (Sandbox Code Playgroud)
This code will always run out of both memory(32G) and GPU(11G) and kill the process. Here is the messages shown on terminal.

I also spot that the program get stuck at sess.run(opt_op). Where is wrong? How can I fix it?
问题是这样的:
ds = ds.shuffle(buffer_size = len(all_image_paths))
Run Code Online (Sandbox Code Playgroud)
使用的缓冲区Dataset.shuffle()是“内存中”缓冲区,因此您可以有效地尝试将整个数据集加载到内存中。
你有几个选项(你可以组合)来解决这个问题:
将缓冲区大小减少到一个更小的数字。
将shuffle()语句移动到语句之前map()。
这意味着我们将在加载图像之前进行 shuffle,因此我们只是将文件名存储在内存缓冲区中以进行 shuffle,而不是存储巨大的张量。