使用 tf.data.TFRecordDataset 时无法解释的 RAM 使用情况和潜在的内存泄漏

str*_*160 5 python tensorflow tensorflow2.0 tensorflow2.x

背景

我们对 TensorFlow 比较陌生。我们正在研究涉及视频数据集的 DL 问题。由于涉及的数据量很大，我们决定对视频进行预处理并将帧以 jpeg 格式存储在 TFRecord 文件中。然后我们计划使用tf.data.TFRecordDataset将数据提供给我们的模型。

视频已被处理成片段，每个片段由 16 帧组成，在一个序列化的张量中。每帧是一个 128*128 RGB 图像，编码为 jpeg。每个序列化段与一些元数据一起存储tf.train.Example在 TFRecords 中作为序列化。

TensorFlow 版本：2.1

代码

下面是我们用来tf.data.TFRecordDataset从 TFRecords创建的代码。您可以忽略num和file字段。

import os
import math
import tensorflow as tf

# Corresponding changes are to be made here
# if the feature description in tf2_preprocessing.py
# is changed
feature_description = {
    'segment': tf.io.FixedLenFeature([], tf.string),
    'file': tf.io.FixedLenFeature([], tf.string),
    'num': tf.io.FixedLenFeature([], tf.int64)
}


def build_dataset(dir_path, batch_size=16, file_buffer=500*1024*1024,
                  shuffle_buffer=1024, label=1):
    '''Return a tf.data.Dataset based on all TFRecords in dir_path
    Args:
    dir_path: path to directory containing the TFRecords
    batch_size: size of batch ie #training examples per element of the dataset
    file_buffer: for TFRecords, size in bytes
    shuffle_buffer: #examples to buffer while shuffling
    label: target label for the example
    '''
    # glob pattern for files
    file_pattern = os.path.join(dir_path, '*.tfrecord')
    # stores shuffled filenames
    file_ds = tf.data.Dataset.list_files(file_pattern)
    # read from multiple files in parallel
    ds = tf.data.TFRecordDataset(file_ds,
                                 num_parallel_reads=tf.data.experimental.AUTOTUNE,
                                 buffer_size=file_buffer)
    # randomly draw examples from the shuffle buffer
    ds = ds.shuffle(buffer_size=1024,
                    reshuffle_each_iteration=True)
    # batch the examples
    # dropping remainder for now, trouble when parsing - adding labels
    ds = ds.batch(batch_size, drop_remainder=True)
    # parse the records into the correct types
    ds = ds.map(lambda x: _my_parser(x, label, batch_size),
                num_parallel_calls=tf.data.experimental.AUTOTUNE)
    ds = ds.prefetch(tf.data.experimental.AUTOTUNE)
    return ds


def _my_parser(examples, label, batch_size):
    '''Parses a batch of serialised tf.train.Example(s)
    Args:
    example: a batch serialised tf.train.Example(s)
    Returns:
    a tuple (segment, label)
    where segment is a tensor of shape (#in_batch, #frames, h, w, #channels)
    '''
    # ex will be a tensor of serialised tensors
    ex = tf.io.parse_example(examples, features=feature_description)
    ex['segment'] = tf.map_fn(lambda x: _parse_segment(x),
                              ex['segment'], dtype=tf.uint8)
    # ignoring filename and segment num for now
    # returns a tuple (tensor1, tensor2)
    # tensor1 is a batch of segments, tensor2 is the corresponding labels
    return (ex['segment'], tf.fill((batch_size, 1), label))


def _parse_segment(segment):
    '''Parses a segment and returns it as a tensor
    A segment is a serialised tensor of a number of encoded jpegs
    '''
    # now a tensor of encoded jpegs
    parsed = tf.io.parse_tensor(segment, out_type=tf.string)
    # now a tensor of shape (#frames, h, w, #channels)
    parsed = tf.map_fn(lambda y: tf.io.decode_jpeg(y), parsed, dtype=tf.uint8)
    return parsed

Run Code Online (Sandbox Code Playgroud)

问题

在训练时，我们的模型崩溃了，因为它用完了 RAM。我们通过运行一些测试并使用带有标志的memory-profiler分析内存来进行调查--include-children。

所有这些测试都是通过使用以下代码简单地多次迭代数据集来运行的（仅限 CPU）：

count = 0
dir_path = 'some/path'
ds = build_dataset(dir_path, file_buffer=some_value)
for itr in range(100):
    print(itr)
    for itx in ds:
        count += 1

Run Code Online (Sandbox Code Playgroud)

我们现在正在处理的 TFRecords 子集的总大小约为 3GB 我们更愿意使用 TF2.1，但我们也可以使用 TF2.2 进行测试。

根据TF2 docs， file_buffer 以字节为单位。

试验一：file_buffer = 500*1024*1024，TF2.1

试用 2 : file_buffer = 500*1024*1024, TF2.2 这个看起来好多了。

试验 3 file_buffer = 1024*1024, TF2.1 我们没有图，但 RAM 最大约为 4.5GB

试验 4 file_buffer = 1024*1024，TF2.1，但是prefetch设置为10

我认为这里存在内存泄漏，因为我们可以看到内存使用量随着时间的推移逐渐增加。

下面的所有试验只运行了 50 次迭代，而不是 100 次

Trial 5 file_buffer = 500*1024*1024，TF2.1，prefetch = 2，所有其他AUTOTUNE值都设置为16。

试6 file_buffer = 1024*1024，其余同上

问题

file_buffer 的值对内存占用有什么影响，对比Trail 1 和Trail 3，file_buffer 减少了500 倍，但内存占用只下降了一半。文件缓冲区值真的以字节为单位吗？
试验 6 的参数看起来很有希望，但尝试用相同的参数训练模型失败了，因为它再次耗尽内存。
TF2.1有bug，为什么试用1和试用2差别这么大？
我们应该继续使用 AUTOTUNE 还是恢复为常量值？

我很乐意用不同的参数运行更多的测试。提前致谢！

归档时间：	5 年，6 月前
查看次数：	550 次
最近记录：	5 年，6 月前