Cha*_*pat 6 python memory gpu tensorflow
TF官方文档[1]提出了2种控制GPU内存分配的方法
内存增长允许 TF 根据使用情况增长内存
tf.config.experimental.set_memory_growth(gpus[0], True)
Run Code Online (Sandbox Code Playgroud)
虚拟设备配置设置内存限制
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
Run Code Online (Sandbox Code Playgroud)
那么,这两种说法是否可以结合起来使用呢?或者这些是相互排斥和相反的?
发言:我们可以将内存增长设置为true,但同时限制内存限制吗?
参考
[1] https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
这很棘手但也是可能的。
首先,不可能以编程方式设置增长控制和内存限制,如下所示。因为内存增长控制不生效。
def set_both_growth_and_limit(
memory_limit: Optional[int] = None,
):
gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
if not gpus:
return
_current: tf.config.PhysicalDevice = None
try:
for index, gpu in enumerate(gpus):
_current = gpu
# Set memory growth control
# Currently, memory growth needs to be the same across GPUs
print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
tf.config.experimental.set_memory_growth(gpu, True)
# Set memory limit
print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
tf.config.set_logical_device_configuration(
device=gpu,
logical_devices=[
tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
]
)
logical_gpus = tf.config.list_logical_devices('GPU')
except RuntimeError as err:
print(f"Memory growth must be set before GPU [{_current}] have been initialized")
raise err
except ValueError as err:
print(f"Invalid GPU device [{_current}]")
raise err
set_both_growth_and_limit(1024)
Run Code Online (Sandbox Code Playgroud)
setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Run Code Online (Sandbox Code Playgroud)
在创建 Tensor 之前,请检查 GPU 使用情况。
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 405738 C /home/user/venv/ml/bin/python3 80MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
创建一个张量。内存增长控制应该防止分配 GPU 内存达到 1G 限制,但事实并非如此。
x = tf.random.uniform([3, 3])
print("Is the Tensor on GPU #0: "),
print(x.device.endswith('GPU:0'))
Run Code Online (Sandbox Code Playgroud)
检查 GPU 使用情况。内存增长控制未生效,内存已分配至限制。
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 405738 C /home/user/venv/ml/bin/python3 -----> 1106MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
分别设置内存增长控制和内存限制也不起作用。
from typing import (
Optional,
List
)
import tensorflow as tf
def list_logical_devices_both_set_control_and_limit(
memory_limit: Optional[int] = None,
):
gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
if not gpus:
return
_current: tf.config.PhysicalDevice = None
try:
for index, gpu in enumerate(gpus):
_current = gpu
# Set memory growth control
# Currently, memory growth needs to be the same across GPUs
print(f"setting memory_growth: index:[{index}] gpu:{gpu}")
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
# Set memory limit
# Calling list_logical_devices above should prevent further configuration
# here when calling set_logical_device_configuration.
print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
tf.config.set_logical_device_configuration(
device=gpu,
logical_devices=[
tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
]
)
logical_gpus = tf.config.list_logical_devices('GPU')
except RuntimeError as err:
print(f"Memory growth must be set before GPU [{_current}] have been initialized")
raise err
except ValueError as err:
print(f"Invalid GPU device [{_current}]")
raise err
list_logical_devices_both_set_control_and_limit(1024)
-----
setting memory_growth: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
setting memory_limit: index:[0] gpu:PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Memory growth must be set before GPU [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] have been initialized
Run Code Online (Sandbox Code Playgroud)
因此,使用环境变量TF_FORCE_GPU_ALLOW_GROWTH=true进行内存增长控制,并用于set_logical_device_configuration内存限制。
壳
$ export TF_FORCE_GPU_ALLOW_GROWTH=true
$ echo ${TF_FORCE_GPU_ALLOW_GROWTH}
-----
true
Run Code Online (Sandbox Code Playgroud)
或者
Jupyter笔记本
%env TF_FORCE_GPU_ALLOW_GROWTH=true
%env TF_FORCE_GPU_ALLOW_GROWTH
-----
env: TF_FORCE_GPU_ALLOW_GROWTH=true
Run Code Online (Sandbox Code Playgroud)
确保 TF_FORCE_GPU_ALLOW_GROWTH 已设置。
import os
print(os.environ['TF_FORCE_GPU_ALLOW_GROWTH'])
-----
true
Run Code Online (Sandbox Code Playgroud)
然后调用set_logical_device_configuration设置内存限制。
from typing import (
Optional,
List
)
import tensorflow as tf
def set_memory_limit(
memory_limit: Optional[int] = None,
):
gpus: List[tf.config.PhysicalDevice] = tf.config.list_physical_devices('GPU')
if not gpus:
return
_current: tf.config.PhysicalDevice = None
try:
for index, gpu in enumerate(gpus):
_current = gpu
# Set memory limit
print(f"setting memory_limit: index:[{index}] gpu:{gpu}")
tf.config.set_logical_device_configuration(
device=gpu,
logical_devices=[
tf.config.LogicalDeviceConfiguration(memory_limit=memory_limit)
]
)
logical_gpus = tf.config.list_logical_devices('GPU')
except RuntimeError as err:
print(f"Memory growth must be set before GPU [{_current}] have been initialized")
raise err
except ValueError as err:
print(f"Invalid GPU device [{_current}]")
raise err
Run Code Online (Sandbox Code Playgroud)
创建一个张量。应控制内存增长,防止分配 GPU 内存达到 1G 的限制。
set_memory_limit(1024)
x = tf.random.uniform([3, 3])
print("Is the Tensor on GPU #0: "),
print(x.device.endswith('GPU:0'))
-----
Is the Tensor on GPU #0:
True
Run Code Online (Sandbox Code Playgroud)
检查GPU和内存增长是否生效。
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 563664 C /home/user/venv/ml/bin/python3 ---> 84MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
创建一个超过 1G 限制的张量(例如 2GB)。它因内存不足而失败。
import numpy as np
GIGA = tf.pow(1024, 3)
x = tf.ones(shape=(GIGA, tf.int8.size, 2), dtype=tf.int8)
-----
2023-11-25 22:08:51.910042: W tensorflow/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.00GiB (rounded to 2147483648)requested by op Fill
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Run Code Online (Sandbox Code Playgroud)
创建小于限制的 500MB 张量。
import numpy as np
MEGA = tf.pow(1024, 2)
x = tf.ones(shape=(MEGA, tf.int8.size, 500), dtype=tf.int8)
tf.shape(x)
-----
<tf.Tensor: shape=(3,), dtype=int32, numpy=array([1048576, 1, 500], dtype=int32)>
Run Code Online (Sandbox Code Playgroud)
检查 GPU。
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 563664 C /home/user/venv/ml/bin/python3 ---> 596MiB |
+---------------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
仅分配可用内存的子集,或仅根据进程需要增加内存使用量。TensorFlow 提供了两种方法来控制这一点。
第一个选项是通过调用 tf.config.experimental.set_memory_growth 来打开内存增长,它尝试仅分配运行时分配所需的 GPU 内存:它一开始分配很少的内存,并且随着程序运行和需要更多 GPU 内存,为 TensorFlow 进程扩展 GPU 内存区域。内存不会被释放,因为它可能导致内存碎片。使用 tf.config.set_visible_devices 方法。启用此选项的另一种方法是将环境变量设置
TF_FORCE_GPU_ALLOW_GROWTH为true。此配置是特定于平台的。Run Code Online (Sandbox Code Playgroud)gpus = tf.config.list_physical_devices('GPU') if gpus: try: # Currently, memory growth needs to be the same across GPUs for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) logical_gpus = tf.config.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") except RuntimeError as e: # Memory growth must be set before GPUs have been initialized print(e)
使用 tf.config.set_logic_device_configuration 配置虚拟 GPU 设备,并对要在 GPU 上分配的总内存设置硬限制。
Run Code Online (Sandbox Code Playgroud)gpus = tf.config.list_physical_devices('GPU') if gpus: # Restrict TensorFlow to only allocate 1GB of memory on the first GPU try: tf.config.set_logical_device_configuration( gpus[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1024)]) logical_gpus = tf.config.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") except RuntimeError as e: # Virtual devices must be set before GPUs have been initialized print(e)
如果为PhysicalDevice启用了内存增长,则运行时初始化将不会分配设备上的所有内存。
设置 tf.config.PhysicalDevice 的逻辑设备配置。一旦运行时初始化,默认情况下,可见的 tf.config.PhysicalDevice 将有一个与之关联的 tf.config.LogicalDevice。指定 tf.config.LogicalDeviceConfiguration 对象列表允许在同一 tf.config.PhysicalDevice 上创建多个设备。
只要运行时未初始化,就可以通过调用此函数来修改逻辑设备配置。运行时初始化后,调用此函数会引发 RuntimeError。
返回运行时创建的逻辑设备列表。调用 tf.config.list_logic_devices 会触发运行时配置运行时可见的任何 tf.config.PhysicalDevice,从而防止进一步配置。
| 归档时间: |
|
| 查看次数: |
1414 次 |
| 最近记录: |