将 tf.dataset 写回 TFRecord

Question

将 tf.dataset 写回 TFRecord

yuv*_*blr 6 tensorflow tensorflow-datasets tensorflow2.0

创建 tf.data.Dataset 后，我想将其写入 TFRecords。

一种方法是遍历整个数据集并在 serializeToString 之后写入 TFRecords。但这并不是最有效的方法。

有没有更简单的方法来做到这一点？TF2.0 中是否有可用的 API？

Answer 1

nes*_*uno 6

您可以使用TensorFlow Datasets (tfds)：这个库不仅是一个随时可用的tf.data.Dataset对象的集合，而且还是一个将原始数据转换为 TFRecords 的工具链。

按照官方指南添加新数据集很简单。简而言之，您只需要实现方法_info和_generate_examples.

特别_generate_examples是，tfds 使用 tfds 在 TFRecords 中创建行的方法。每个_generate_examples产生的元素都是一个字典；每个字典都是 TFRecord 文件中的一行。

例如（保留自官方文档）generate_examples下面是 tfds 用来保存 TFRecords 的，每一个都有记录“image_description”、“image”、“label”。

def _generate_examples(self, images_dir_path, labels):
  # Read the input data out of the source files
  for image_file in tf.io.gfile.listdir(images_dir_path):
    ...
  with tf.io.gfile.GFile(labels) as f:
    ...

  # And yield examples as feature dictionaries
  for image_id, description, label in data:
    yield image_id, {
        "image_description": description,
        "image": "%s/%s.jpeg" % (images_dir_path, image_id),
        "label": label,
    }

Run Code Online (Sandbox Code Playgroud)

在你的情况下，你可以只使用tf.data.Dataset你已经拥有的对象，并循环遍历它（在 generate_examples 方法中），并产生 TFRecord 的行。

这样，tfds 将负责序列化，您将在~/tensorflow_datasets为数据集创建的 TFRecord 文件夹中找到。

归档时间：	6 年，5 月前
查看次数：	2436 次
最近记录：	6 年，5 月前