tensorflow gpu - 内存增长和内存限制可以结合使用吗?

Cha*_*pat 6 python memory gpu tensorflow

TF官方文档[1]提出了2种控制GPU内存分配的方法

内存增长允许 TF 根据使用情况增长内存

tf.config.experimental.set_memory_growth(gpus[0], True)
Run Code Online (Sandbox Code Playgroud)

虚拟设备配置设置内存限制

tf.config.experimental.set_virtual_device_configuration(
  gpus[0],
  [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])

Run Code Online (Sandbox Code Playgroud)

那么,这两种说法是否可以结合起来使用呢?或者这些是相互排斥和相反的?

发言:我们可以将内存增长设置为true,但同时限制内存限制吗?

参考

[1] https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

mon*_*mon 0

这很棘手但也是可能的。

问题

首先,不可能以编程方式设置增长控制和内存限制,如下所示。因为内存增长控制不生效。

def set_both_growth_and_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory growth control
            # Currently, memory growth needs to be the same across GPUs
            print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
            tf.config.experimental.set_memory_growth(gpu, True)

            # Set memory limit
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )

        logical_gpus = tf.config.list_logical_devices('GPU')

    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err

set_both_growth_and_limit(1024)
Run Code Online (Sandbox Code Playgroud)
setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Run Code Online (Sandbox Code Playgroud)

在创建 Tensor 之前,请检查 GPU 使用情况。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    405738      C   /home/user/venv/ml/bin/python3               80MiB |
+---------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

创建一个张量。内存增长控制应该防止分配 GPU 内存达到 1G 限制,但事实并非如此。

x = tf.random.uniform([3, 3])

print("Is the Tensor on GPU #0:  "),
print(x.device.endswith('GPU:0'))
Run Code Online (Sandbox Code Playgroud)

检查 GPU 使用情况。内存增长控制未生效,内存已分配至限制。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    405738      C   /home/user/venv/ml/bin/python3      -----> 1106MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

分别设置内存增长控制和内存限制也不起作用。

from typing import (
    Optional,
    List
)
import tensorflow as tf


def list_logical_devices_both_set_control_and_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory growth control
            # Currently, memory growth needs to be the same across GPUs
            print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')

            # Set memory limit
            # Calling list_logical_devices above should prevent further configuration 
            # here when calling set_logical_device_configuration.
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )
            logical_gpus = tf.config.list_logical_devices('GPU')
    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err
    
list_logical_devices_both_set_control_and_limit(1024)
-----
setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Memory growth must be set before GPU [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] have been initialized
Run Code Online (Sandbox Code Playgroud)

解决方案

因此,使用环境变量TF_FORCE_GPU_ALLOW_GROWTH=true进行内存增长控制,并用于set_logical_device_configuration内存限制。

设置 TF_FORCE_GPU_ALLOW_GROWTH

$ export TF_FORCE_GPU_ALLOW_GROWTH=true
$ echo ${TF_FORCE_GPU_ALLOW_GROWTH}
-----
true
Run Code Online (Sandbox Code Playgroud)

或者

Jupyter笔记本

%env TF_FORCE_GPU_ALLOW_GROWTH=true
%env TF_FORCE_GPU_ALLOW_GROWTH
-----
env: TF_FORCE_GPU_ALLOW_GROWTH=true
Run Code Online (Sandbox Code Playgroud)

设置内存限制

确保 TF_FORCE_GPU_ALLOW_GROWTH 已设置。

import os
print(os.environ['TF_FORCE_GPU_ALLOW_GROWTH'])
-----
true
Run Code Online (Sandbox Code Playgroud)

然后调用set_logical_device_configuration设置内存限制。

from typing import (
    Optional,
    List
)
import tensorflow as tf



def set_memory_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory limit
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )
            logical_gpus = tf.config.list_logical_devices('GPU')
    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err
Run Code Online (Sandbox Code Playgroud)

验证增长控制和内存限制

创建一个张量。应控制内存增长,防止分配 GPU 内存达到 1G 的限制。

set_memory_limit(1024)

x = tf.random.uniform([3, 3])
print("Is the Tensor on GPU #0:  "),
print(x.device.endswith('GPU:0'))
-----
Is the Tensor on GPU #0:  
True
Run Code Online (Sandbox Code Playgroud)

检查GPU和内存增长是否生效。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    563664      C   /home/user/venv/ml/bin/python3          ---> 84MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

创建一个超过 1G 限制的张量(例如 2GB)。它因内存不足而失败。

import numpy as np
GIGA = tf.pow(1024, 3)
x = tf.ones(shape=(GIGA, tf.int8.size, 2), dtype=tf.int8)
-----
2023-11-25 22:08:51.910042: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.00GiB (rounded to 2147483648)requested by op Fill
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Run Code Online (Sandbox Code Playgroud)

创建小于限制的 500MB 张量。

import numpy as np
MEGA = tf.pow(1024, 2)
x = tf.ones(shape=(MEGA, tf.int8.size, 500), dtype=tf.int8)
tf.shape(x)
-----
<tf.Tensor: shape=(3,), dtype=int32, numpy=array([1048576,       1,     500], dtype=int32)>
Run Code Online (Sandbox Code Playgroud)

检查 GPU。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    563664      C   /home/user/venv/ml/bin/python3         ---> 596MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

参考

仅分配可用内存的子集,或仅根据进程需要增加内存使用量。TensorFlow 提供了两种方法来控制这一点。

第一个选项是通过调用 tf.config.experimental.set_memory_growth 来打开内存增长,它尝试仅分配运行时分配所需的 GPU 内存:它一开始分配很少的内存,并且随着程序运行和需要更多 GPU 内存,为 TensorFlow 进程扩展 GPU 内存区域。内存不会被释放,因为它可能导致内存碎片。使用 tf.config.set_visible_devices 方法。启用此选项的另一种方法是将环境变量设置TF_FORCE_GPU_ALLOW_GROWTHtrue。此配置是特定于平台的。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)
Run Code Online (Sandbox Code Playgroud)

使用 tf.config.set_logic_device_configuration 配置虚拟 GPU 设备,并对要在 GPU 上分配的总内存设置硬限制

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)
Run Code Online (Sandbox Code Playgroud)

如果为PhysicalDevice启用了内存增长,则运行时初始化将不会分配设备上的所有内存。

设置 tf.config.PhysicalDevice 的逻辑设备配置。一旦运行时初始化,默认情况下,可见的 tf.config.PhysicalDevice 将有一个与之关联的 tf.config.LogicalDevice。指定 tf.config.LogicalDeviceConfiguration 对象列表允许在同一 tf.config.PhysicalDevice 上创建多个设备。

只要运行时未初始化,就可以通过调用此函数来修改逻辑设备配置。运行时初始化后,调用此函数会引发 RuntimeError

返回运行时创建的逻辑设备列表。调用 tf.config.list_logic_devices 会触发运行时配置运行时可见的任何 tf.config.PhysicalDevice,从而防止进一步配置