在分配具有形状的张量时理解ResourceExhaustedError:OOM

Question

在分配具有形状的张量时理解ResourceExhaustedError:OOM

我正在尝试使用tensorflow实现跳过思维模型,并在此处放置当前版本.

目前我使用我的机器的一个GPU(总共2个GPU)和GPU信息

2017-09-06 11:29:32.657299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:02:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB

Run Code Online (Sandbox Code Playgroud)

但是,当我尝试向模型提供数据时,我得到了OOM.我尝试调试如下:

我运行后立即使用以下代码段 sess.run(tf.global_variables_initializer())

    logger.info('Total: {} params'.format(
        np.sum([
            np.prod(v.get_shape().as_list())
            for v in tf.trainable_variables()
        ])))

Run Code Online (Sandbox Code Playgroud)

得到了2017-09-06 11:29:51,333 INFO main main.py:127 - Total: 62968629 params,大概是关于240Mb如果全部使用tf.float32.输出tf.global_variables是

[<tf.Variable 'embedding/embedding_matrix:0' shape=(155229, 200) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'encoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'decoder/weights:0' shape=(200, 155229) dtype=float32_ref>,
 <tf.Variable 'decoder/biases:0' shape=(155229,) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'decoder/previous_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/kernel:0' shape=(400, 400) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/gates/bias:0' shape=(400,) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/kernel:0' shape=(400, 200) dtype=float32_ref>,
 <tf.Variable 'decoder/next_decoder/rnn/gru_cell/candidate/bias:0' shape=(200,) dtype=float32_ref>,
 <tf.Variable 'global_step:0' shape=() dtype=int32_ref>]

Run Code Online (Sandbox Code Playgroud)

在我的训练句话,我有一个数据数组,其形状(164652, 3, 30),即sample_size x 3 x time_step,在3这里是指前一句,当前句子和下一个句子.该训练数据的大小是关于57Mb并存储在loader.然后我用write一个生成器函数来得到句子,看起来像

def iter_batches(self, batch_size=128, time_major=True, shuffle=True):

    num_samples = len(self._sentences)
    if shuffle:
        samples = self._sentences[np.random.permutation(num_samples)]
    else:
        samples = self._sentences

    batch_start = 0
    while batch_start < num_samples:
        batch = samples[batch_start:batch_start + batch_size]

        lens = (batch != self._vocab[self._vocab.pad_token]).sum(axis=2)
        y, x, z = batch[:, 0, :], batch[:, 1, :], batch[:, 2, :]
        if time_major:
            yield (y.T, lens[:, 0]), (x.T, lens[:, 1]), (z.T, lens[:, 2])
        else:
            yield (y, lens[:, 0]), (x, lens[:, 1]), (z, lens[:, 2])
        batch_start += batch_size

Run Code Online (Sandbox Code Playgroud)

训练循环看起来像

for epoch in num_epochs:
    batches = loader.iter_batches(batch_size=args.batch_size)
    try:
        (y, y_lens), (x, x_lens), (z, z_lens) =  next(batches)
        _, summaries, loss_val = sess.run(
        [train_op, train_summary_op, st.loss],
        feed_dict={
            st.inputs: x,
            st.sequence_length: x_lens,
            st.previous_targets: y,
            st.previous_target_lengths: y_lens,
            st.next_targets: z,
            st.next_target_lengths: z_lens
        })
    except StopIteraton:
        ...

Run Code Online (Sandbox Code Playgroud)

然后我得到了一个OOM.如果我注释掉整个try身体(不提供数据),脚本运行就好了.

我不知道为什么我会在如此小的数据范围内获得OOM.用nvidia-smi我总是得到

Wed Sep  6 12:03:37 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59                 Driver Version: 384.59                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   44C    P2    60W / 275W |  10623MiB / 11172MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   43C    P2    62W / 275W |  10621MiB / 11171MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     32748    C   python3                                      10613MiB |
|    1     32748    C   python3                                      10611MiB |
+-----------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

我无法看到我的脚本的实际 GPU使用情况,因为tensorflow总是在开始时窃取所有内存.这里的实际问题是我不知道如何调试这个.

我在StackOverflow上读过一些关于OOM的帖子.大多数情况发生在向模型提供大量测试集数据并通过小批量提供数据时可以避免这个问题.但我不知道为什么我的11Gb 1080Ti会看到如此小的数据和参数组合,因为它只是试图分配一个矩阵大小的错误[3840 x 155229].(解码器的输出矩阵,3840 = 30(time_steps) x 128(batch_size),155229是vocab_size).

2017-09-06 12:14:45.787566: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ********************************************************************************************xxxxxxxx
2017-09-06 12:14:45.787597: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
2017-09-06 12:14:45.788735: W tensorflow/core/framework/op_kernel.cc:1158] Resource exhausted: OOM when allocating tensor with shape[3840,155229]
     [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
2017-09-06 12:14:45.790453: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 2857 get requests, put_count=2078 evicted_count=1000 eviction_rate=0.481232 and unsatisfied allocation rate=0.657683
2017-09-06 12:14:45.790482: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3840,155229]
     [[Node: decoder/previous_decoder/Add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](decoder/previous_decoder/MatMul, decoder/biases/read)]]
     [[Node: GradientDescent/update/_146 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2166_GradientDescent/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

During handling of the above exception, another exception occurred:

Run Code Online (Sandbox Code Playgroud)

任何帮助将不胜感激.提前致谢.

Answer 1

Den*_*ker 17

让我们逐个分开问题:

关于tensorflow预先分配所有内存,您可以使用以下代码片段让tensorflow在需要时分配内存.这样你就可以理解事情的进展.

gpu_options = tf.GPUOptions(allow_growth=True)
session = tf.InteractiveSession(config=tf.ConfigProto(gpu_options=gpu_options))

Run Code Online (Sandbox Code Playgroud)

这与同样的工作tf.Session(),而不是tf.InteractiveSession()如果你喜欢.

关于尺寸的第二件事,由于没有关于您的网络规模的信息,我们无法估计出现了什么问题.但是,您也可以逐步调试所有网络.例如,仅使用一个图层创建网络,获取其输出,创建会话和源值一次,并可视化您消耗的内存量.迭代此调试会话,直到您看到内存不足的程度.

请注意,3840 x 155229输出确实是真正的大输出.它意味着~600M神经元,每层只有~2.22GB.如果你有任何类似大小的图层,所有这些图层将相加,以便快速填充你的GPU内存.

此外,这仅适用于正向,如果您使用此层进行训练,优化器添加的反向传播和层将将此大小乘以2.因此,对于训练,您只需为输出层消耗~5 GB.

我建议您修改网络并尝试减少批量大小/参数计数以使您的模型适合GPU

新的 TF 2.0 内存分配方式如下：``gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True)```` (2认同)

Answer 2

roc*_*yne 7

这在技术上可能没有意义,但经过一段时间的实验,这是我发现的.

环境:Ubuntu 16.04

运行命令时

NVIDIA-SMI

您将获得已安装的Nvidia显卡的总内存消耗.一个例子如图所示

当你运行你的神经网络时,你的消费可能会改变

内存消耗通常给予python.由于某些奇怪的原因,如果此进程无法成功终止,则永远不会释放内存.如果您尝试运行神经网络应用程序的另一个实例,您将收到内存分配错误.困难的方法是尝试找出使用进程ID终止此进程的方法.简单的方法是重新启动计算机并重试.如果它是与代码相关的错误,那么这将不起作用.

使用'kill'终止进程对我有用 - 所以这是最简单的方法. (2认同)
自从发布此答案以来，当我的ML模型正在训练时，我一直在终端中执行很多“上箭头+ Enter”操作……谢谢！ (2认同)
@Acy 尝试“观看 nvidia-smi”:) (2认同)

Answer 3

Jib*_*hew 5

您正在耗尽内存，可以减小批次大小，这会减慢训练过程的速度，但可以容纳数据。

Answer 4

Mar*_*kus 5

我知道你的问题是关于tensorflow. 无论如何，Keras与tensorflow-backend 一起使用是我的用例，并导致了相同的OOM -Error。

我使用 Keras 的解决方案tf-backend是使用 Keras 的fit_generator()方法。在此之前我只使用了fit()- 方法（导致OOM -Error）。

fit_generator()如果您无法将数据放入主内存或必须在 GPU 训练的同时访问 CPU 资源，则通常很有用。例如，请参阅文档中的摘录：

为了提高效率，生成器与模型并行运行。例如，这允许您在 CPU 上的图像上进行实时数据增强，同时在 GPU 上训练模型。

显然，这也有助于防止显卡内存溢出。

编辑：如果您需要一些灵感来了解如何开发自己的（线程安全）生成器来扩展 KerasSequence类，然后可以将其用于fit_generator()，您可以查看我在本问答中提供的一些信息。

归档时间：	8 年，2 月前
查看次数：	25836 次
最近记录：	6 年，1 月前