如何将多个tfrecords文件合并为一个文件？

Question

如何将多个tfrecords文件合并为一个文件？

我的问题是，如果要为我的数据创建一个tfrecords文件，大约需要15天才能完成它，它有500000对模板，每个模板为32帧（图像）。为了节省时间，我有3个GPU，因此我想可以在一个GPU上创建三个tfrocords文件，每个文件一个，然后在5天内完成创建tfrecords的操作。但是后来我搜索了将这三个文件合并到一个文件中的方法，但找不到合适的解决方案。

因此，有没有办法将这三个文件合并到一个文件中，或者有没有办法知道我使用的是Dataset API，通过提供从这三个tfrecords文件提取的一批示例来训练我的网络。

Answer 1

Mol*_*ins 9

Addressing the question title directly for anyone looking to merge multiple .tfrecord files:

The most convenient approach would be to use the tf.Data API: (adapting an example from the docs)

# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)

# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)

Run Code Online (Sandbox Code Playgroud)

However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.

You may also refer to a longer discussion regarding multiple .tfrecord files on Data Science Stackexchange

Answer 2

hol*_*scn 6

正如两个月前提出的问题一样，我认为您已经找到了解决方案。对于以下情况，答案是否定的，您无需创建单个HUGE tfrecord文件。只需使用新的DataSet API：

dataset = tf.data.TFRecordDataset(filenames_to_read,
    compression_type=None,    # or 'GZIP', 'ZLIB' if compress you data.
    buffer_size=10240,        # any buffer size you want or 0 means no buffering
    num_parallel_reads=os.cpu_count()  # or 0 means sequentially reading
)

# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)

# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())

dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...

Run Code Online (Sandbox Code Playgroud)

有关详细信息，请检查文档。

事实上，是的，你是对的。我发现解决方案只是通过 tf.data.TFRecodDataset() 传递文件名列表。我忘了提及答案。但是，对于另一个较小的数据集，我注意到如果您传递一个 tfrecord 文件比传递多个 tfrecords 文件的准确性要好，我不知道为什么。我认为这两种方式的唯一区别是洗牌的方式不同。那么你认为拥有一个 tfrecords 文件比使用多个 tfrecords 文件更好吗？ (2认同)

归档时间：	7 年，5 月前
查看次数：	2876 次
最近记录：	5 年，11 月前