如何创建多个 TFRecord 文件而不是制作一个大文件然后将其拆分？

Question

如何创建多个 TFRecord 文件而不是制作一个大文件然后将其拆分？

我正在处理相当大的时间序列数据集，准备好的SequenceExamples 然后写入TFRecord. 这会产生一个相当大的文件（超过 100GB），但我希望将其存储在块中。我试过了：

file = '/path/to/tf_record_0.tfrecords'
file_index = 0

   for record in dataset:
      # fill the time series window, prepare the sequence_example, etc.

      if os.path.exists(file) and os.path.getsize(file) > 123456789:
         file = file.replace(str(file_index), str(file_index + 1))
         file_index += 1

            with tf.io.TFRecordWriter(file) as writer:
               writer.write(sequence_example.SerializeToString())

Run Code Online (Sandbox Code Playgroud)

...但是由于TFRecordWriter打开像Python这样的文件，open(file, mode='w')它每次进入with块时都会覆盖自己（除了它是非常丑陋的解决方案），并且从我读到的内容来看，没有办法改变这种行为。更改file内部with块的路径显然会引发错误。

所以我的问题是，有没有办法TFRecord在循环和处理我的数据集时当前达到一定大小时创建下一个文件？TFRecord当我不处理除了系统内存不足之外的任何类型的瓶颈时，拥有较小的文件是否有好处？如果我是正确的，Tensorflow 可以毫无问题地从磁盘读取它（尽管可能还有其他原因，人们更喜欢拥有多个文件）。

我能想到的一件事是在list准备保存的序列中创建某种缓冲区，并在TFRecord该缓冲区达到某个阈值时创建/保存。

Answer 1

TF_*_*ort 0

使用Tensorflow 2.1.0，也许你在处理此类问题时可以尝试这种方法。

file = '/content/tmp/records/tf_record_{}.tfrecords'
file_index_count = 0

file_limit = 10
for index in range(file_limit):
  tfrecord_writer = tf.io.TFRecordWriter(file.format(file_index_count))
  ####
  # Add your Preprocessing Codes here
  # ie. split/divide array into 10 parts
  # ie. test_in = array[int(10*index): int(10*(index+1))]
  ####
  serial_test_in = serialize_example(test_in) 
  tfrecord_writer.write(serial_test_in)
  ###
  # Add your conditional for file index
  # ie. if size < totalsize
  ###
  file_index_count += 1

Run Code Online (Sandbox Code Playgroud)

这会生成10 个TF 记录文件，如此屏幕截图所示。

归档时间：	6 年前
查看次数：	2596 次
最近记录：	5 年，1 月前