tensorflow gpu - 内存增长和内存限制可以结合使用吗？

Question

tensorflow gpu - 内存增长和内存限制可以结合使用吗？

Cha*_*pat 6 python memory gpu tensorflow

TF官方文档[1]提出了2种控制GPU内存分配的方法

内存增长允许 TF 根据使用情况增长内存

tf.config.experimental.set_memory_growth(gpus[0], True)

Run Code Online (Sandbox Code Playgroud)

虚拟设备配置设置内存限制

tf.config.experimental.set_virtual_device_configuration(
  gpus[0],
  [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])

Run Code Online (Sandbox Code Playgroud)

那么，这两种说法是否可以结合起来使用呢？或者这些是相互排斥和相反的？

发言：我们可以将内存增长设置为true，但同时限制内存限制吗？

参考

[1] https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

Answer 1

mon*_*mon 0

这很棘手但也是可能的。

问题

首先，不可能以编程方式设置增长控制和内存限制，如下所示。因为内存增长控制不生效。

def set_both_growth_and_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory growth control
            # Currently, memory growth needs to be the same across GPUs
            print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
            tf.config.experimental.set_memory_growth(gpu, True)

            # Set memory limit
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )

        logical_gpus = tf.config.list_logical_devices('GPU')

    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err

set_both_growth_and_limit(1024)

Run Code Online (Sandbox Code Playgroud)

setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

Run Code Online (Sandbox Code Playgroud)

在创建 Tensor 之前，请检查 GPU 使用情况。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    405738      C   /home/user/venv/ml/bin/python3               80MiB |
+---------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

创建一个张量。内存增长控制应该防止分配 GPU 内存达到 1G 限制，但事实并非如此。

x = tf.random.uniform([3, 3])

print("Is the Tensor on GPU #0:  "),
print(x.device.endswith('GPU:0'))

Run Code Online (Sandbox Code Playgroud)

检查 GPU 使用情况。内存增长控制未生效，内存已分配至限制。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    405738      C   /home/user/venv/ml/bin/python3      -----> 1106MiB |
+---------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

分别设置内存增长控制和内存限制也不起作用。

from typing import (
    Optional,
    List
)
import tensorflow as tf


def list_logical_devices_both_set_control_and_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory growth control
            # Currently, memory growth needs to be the same across GPUs
            print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')

            # Set memory limit
            # Calling list_logical_devices above should prevent further configuration 
            # here when calling set_logical_device_configuration.
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )
            logical_gpus = tf.config.list_logical_devices('GPU')
    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err
    
list_logical_devices_both_set_control_and_limit(1024)
-----
setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Memory growth must be set before GPU [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] have been initialized

Run Code Online (Sandbox Code Playgroud)

解决方案

因此，使用环境变量TF_FORCE_GPU_ALLOW_GROWTH=true进行内存增长控制，并用于set_logical_device_configuration内存限制。

设置 TF_FORCE_GPU_ALLOW_GROWTH

壳

$ export TF_FORCE_GPU_ALLOW_GROWTH=true
$ echo ${TF_FORCE_GPU_ALLOW_GROWTH}
-----
true

Run Code Online (Sandbox Code Playgroud)

或者

Jupyter笔记本

%env TF_FORCE_GPU_ALLOW_GROWTH=true
%env TF_FORCE_GPU_ALLOW_GROWTH
-----
env: TF_FORCE_GPU_ALLOW_GROWTH=true

Run Code Online (Sandbox Code Playgroud)

设置内存限制

确保 TF_FORCE_GPU_ALLOW_GROWTH 已设置。

import os
print(os.environ['TF_FORCE_GPU_ALLOW_GROWTH'])
-----
true

Run Code Online (Sandbox Code Playgroud)

然后调用set_logical_device_configuration设置内存限制。

from typing import (
    Optional,
    List
)
import tensorflow as tf



def set_memory_limit(
        memory_limit: Optional[int] = None,
):
    gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
    if not gpus:
        return

    _current: tf.config.PhysicalDevice = None
    try:
        for index, gpu in enumerate(gpus):
            _current = gpu

            # Set memory limit
            print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
            tf.config.set_logical_device_configuration(
                device=gpu,
                logical_devices=[
                    tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
                ]
            )
            logical_gpus = tf.config.list_logical_devices('GPU')
    except RuntimeError as err:
        print(f"Memory growth must be set before GPU [{_current}] have been initialized")
        raise err
    except ValueError as err:
        print(f"Invalid GPU device [{_current}]")
        raise err

Run Code Online (Sandbox Code Playgroud)

验证增长控制和内存限制

创建一个张量。应控制内存增长，防止分配 GPU 内存达到 1G 的限制。

set_memory_limit(1024)

x = tf.random.uniform([3, 3])
print("Is the Tensor on GPU #0:  "),
print(x.device.endswith('GPU:0'))
-----
Is the Tensor on GPU #0:  
True

Run Code Online (Sandbox Code Playgroud)

检查GPU和内存增长是否生效。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    563664      C   /home/user/venv/ml/bin/python3          ---> 84MiB |
+---------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

创建一个超过 1G 限制的张量（例如 2GB）。它因内存不足而失败。

import numpy as np
GIGA = tf.pow(1024, 3)
x = tf.ones(shape=(GIGA, tf.int8.size, 2), dtype=tf.int8)
-----
2023-11-25 22:08:51.910042: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.00GiB (rounded to 2147483648)requested by op Fill
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

Run Code Online (Sandbox Code Playgroud)

创建小于限制的 500MB 张量。

import numpy as np
MEGA = tf.pow(1024, 2)
x = tf.ones(shape=(MEGA, tf.int8.size, 500), dtype=tf.int8)
tf.shape(x)
-----
<tf.Tensor: shape=(3,), dtype=int32, numpy=array([1048576,       1,     500], dtype=int32)>

Run Code Online (Sandbox Code Playgroud)

检查 GPU。

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    563664      C   /home/user/venv/ml/bin/python3         ---> 596MiB |
+---------------------------------------------------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

参考

使用 GPU

仅分配可用内存的子集，或仅根据进程需要增加内存使用量。TensorFlow 提供了两种方法来控制这一点。

第一个选项是通过调用 tf.config.experimental.set_memory_growth 来打开内存增长，它尝试仅分配运行时分配所需的 GPU 内存：它一开始分配很少的内存，并且随着程序运行和需要更多 GPU 内存，为 TensorFlow 进程扩展 GPU 内存区域。内存不会被释放，因为它可能导致内存碎片。使用 tf.config.set_visible_devices 方法。启用此选项的另一种方法是将环境变量设置TF_FORCE_GPU_ALLOW_GROWTH为true。此配置是特定于平台的。
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)
Run Code Online (Sandbox Code Playgroud)

使用 tf.config.set_logic_device_configuration 配置虚拟 GPU 设备，并对要在 GPU 上分配的总内存设置硬限制。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.set_logical_device_configuration(
        gpus[0],
        [tf.config.LogicalDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

Run Code Online (Sandbox Code Playgroud)

tf.config.experimental.set_memory_growth

如果为PhysicalDevice启用了内存增长，则运行时初始化将不会分配设备上的所有内存。

tf.config.set_逻辑_设备_配置

设置 tf.config.PhysicalDevice 的逻辑设备配置。一旦运行时初始化，默认情况下，可见的 tf.config.PhysicalDevice 将有一个与之关联的 tf.config.LogicalDevice。指定 tf.config.LogicalDeviceConfiguration 对象列表允许在同一 tf.config.PhysicalDevice 上创建多个设备。

只要运行时未初始化，就可以通过调用此函数来修改逻辑设备配置。运行时初始化后，调用此函数会引发 RuntimeError。

tf.config.list_逻辑_设备

返回运行时创建的逻辑设备列表。调用 tf.config.list_logic_devices 会触发运行时配置运行时可见的任何 tf.config.PhysicalDevice，从而防止进一步配置。

归档时间：	4 年，7 月前
查看次数：	1414 次
最近记录：	2 年，1 月前