TensorFlow DataSet API causes graph size to explode

Oha*_*eir 5 tensorflow

I have a very bug data set for training.

I'm using the data set API like so:

self._dataset = tf.contrib.data.Dataset.from_tensor_slices((self._images_list, self._labels_list))

self._dataset = self._dataset.map(self.load_image)

self._dataset = self._dataset.batch(batch_size)
self._dataset = self._dataset.shuffle(buffer_size=shuffle_buffer_size)
self._dataset = self._dataset.repeat()

self._iterator = self._dataset.make_one_shot_iterator()
Run Code Online (Sandbox Code Playgroud)

If I use for the training a small amount of the data then all is well. If I use all my data then TensorFlow will crash with this error: ValueError: GraphDef cannot be larger than 2GB.

It seems like TensorFlow tries to load all the data instead of loading only the data that it needs... not sure...

Any advice will be great!

Update... found a solution/workaround

according to this post: Tensorflow Dataset API doubles graph protobuff filesize

I replaced the make_one_shot_iterator() with make_initializable_iterator() and of course called the iterator initializer after creating the session:

init = tf.global_variables_initializer()
sess.run(init)
sess.run(train_data._iterator.initializer)
Run Code Online (Sandbox Code Playgroud)

But I'm leaving the question open as to me it seems like a workaround and not a solution...

dra*_*nie 3

https://www.tensorflow.org/guide/datasets#consuming_numpy_arrays

请注意,上面的代码片段会将特征和标签数组作为 tf.constant() 操作嵌入到 TensorFlow 图中。这对于小数据集效果很好,但会浪费内存——因为数组的内容将被复制多次——并且可能会遇到 tf.GraphDef 协议缓冲区的 2GB 限制。作为替代方案,您可以根据 tf.placeholder() 张量定义数据集,并在数据集上初始化迭代器时提供 NumPy 数组。

而不是使用

dataset = tf.data.Dataset.from_tensor_slices((features, labels))
Run Code Online (Sandbox Code Playgroud)

使用

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
Run Code Online (Sandbox Code Playgroud)