我的问题是,如果要为我的数据创建一个tfrecords文件,大约需要15天才能完成它,它有500000对模板,每个模板为32帧(图像)。为了节省时间,我有3个GPU,因此我想可以在一个GPU上创建三个tfrocords文件,每个文件一个,然后在5天内完成创建tfrecords的操作。但是后来我搜索了将这三个文件合并到一个文件中的方法,但找不到合适的解决方案。
因此,有没有办法将这三个文件合并到一个文件中,或者有没有办法知道我使用的是Dataset API,通过提供从这三个tfrecords文件提取的一批示例来训练我的网络。
Addressing the question title directly for anyone looking to merge multiple .tfrecord
files:
The most convenient approach would be to use the tf.Data API: (adapting an example from the docs)
# Create dataset from multiple .tfrecord files
list_of_tfrecord_files = [dir1, dir2, dir3, dir4]
dataset = tf.data.TFRecordDataset(list_of_tfrecord_files)
# Save dataset to .tfrecord file
filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(dataset)
Run Code Online (Sandbox Code Playgroud)
However, as pointed out by holmescn, you'd likely be better off leaving the .tfrecord files as separate files and reading them together as a single tensorflow dataset.
You may also refer to a longer discussion regarding multiple .tfrecord
files on Data Science Stackexchange
正如两个月前提出的问题一样,我认为您已经找到了解决方案。对于以下情况,答案是否定的,您无需创建单个HUGE tfrecord文件。只需使用新的DataSet API:
dataset = tf.data.TFRecordDataset(filenames_to_read,
compression_type=None, # or 'GZIP', 'ZLIB' if compress you data.
buffer_size=10240, # any buffer size you want or 0 means no buffering
num_parallel_reads=os.cpu_count() # or 0 means sequentially reading
)
# Maybe you want to prefetch some data first.
dataset = dataset.prefetch(buffer_size=batch_size)
# Decode the example
dataset = dataset.map(single_example_parser, num_parallel_calls=os.cpu_count())
dataset = dataset.shuffle(buffer_size=number_larger_than_batch_size)
dataset = dataset.batch(batch_size).repeat(num_epochs)
...
Run Code Online (Sandbox Code Playgroud)
有关详细信息,请检查文档。
归档时间: |
|
查看次数: |
2876 次 |
最近记录: |