Apr*_*cot 5 python gpu tensorflow
我正在学习使用 Tensorflow 进行对象检测。为了加快训练过程,我采用了一个具有 4 个 GPU 的 AWS g3.16xlarge 实例。我正在使用以下代码运行训练过程:
export CUDA_VISIBLE_DEVICES=0,1,2,3
python object_detection/train.py --logtostderr --pipeline_config_path=/home/ubuntu/builder/rcnn.config --train_dir=/home/ubuntu/builder/experiments/training/
Run Code Online (Sandbox Code Playgroud)
在 rcnn.config 里面 - 我已经设置了batch-size = 1. 在运行时,我得到以下输出:
控制台输出
2018-11-09 07:25:50.104310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-11-09 07:25:50.104385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3
2018-11-09 07:25:50.104395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0: Y N N N
2018-11-09 07:25:50.104402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1: N Y N N
2018-11-09 07:25:50.104409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2: N N Y N
2018-11-09 07:25:50.104416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3: N N N Y
2018-11-09 07:25:50.104429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:1b.0, compute capability: 5.2)
2018-11-09 07:25:50.104439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla M60, pci bus id: 0000:00:1c.0, compute capability: 5.2)
2018-11-09 07:25:50.104446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2018-11-09 07:25:50.104455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)
Run Code Online (Sandbox Code Playgroud)
当我运行时nvidia-smi,我得到以下输出:
nvidia-smi 输出
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 Off | 0000:00:1B.0 Off | 0 |
| N/A 52C P0 129W / 150W | 7382MiB / 7612MiB | 92% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M60 Off | 0000:00:1C.0 Off | 0 |
| N/A 33C P0 38W / 150W | 7237MiB / 7612MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla M60 Off | 0000:00:1D.0 Off | 0 |
| N/A 40C P0 38W / 150W | 7237MiB / 7612MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla M60 Off | 0000:00:1E.0 Off | 0 |
| N/A 34C P0 39W / 150W | 7237MiB / 7612MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 97860 C python 7378MiB |
| 1 97860 C python 7233MiB |
| 2 97860 C python 7233MiB |
| 3 97860 C python 7233MiB |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
并**nvidia-smi dmon**提供以下输出:
# gpu pwr temp sm mem enc dec mclk pclk
# Idx W C % % % % MHz MHz
0 158 69 90 69 0 0 2505 1177
1 38 36 0 0 0 0 2505 556
2 38 45 0 0 0 0 2505 556
3 39 37 0 0 0 0 2505 556
Run Code Online (Sandbox Code Playgroud)
我对每个输出感到困惑。当我在程序识别 4 个不同 GPU 的可用性时读取控制台输出时,在 nvidia-smi 输出中,挥发性 GPU-Util 百分比仅显示第一个 GPU,其余的为零。然而,同一张表打印了底部所有 4 个 gpu 的内存使用情况。并且 nvidia-smi dmon 仅打印第一个 gpu 的 sm 值,而其他 gpu 的值为零。从这个博客我了解零dmon表示 GPU 是免费的。
我想了解的是,train.py 是否利用了我实例中的所有 4 个 GPU。如果它没有利用所有 GPU,我如何确保object_detection/train.py针对所有 GPU 优化 tensorflow。
检查它是否正在返回所有 GPU 的列表。
tf.test.gpu_device_name()
Run Code Online (Sandbox Code Playgroud)
返回 GPU 设备的名称(如果可用)或空字符串。
那么你可以做这样的事情来使用所有可用的 GPU。
# Creates a graph.
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))
Run Code Online (Sandbox Code Playgroud)
您会看到以下输出:
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K20m, pci bus
id: 0000:02:00.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla K20m, pci bus
id: 0000:03:00.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla K20m, pci bus
id: 0000:83:00.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla K20m, pci bus
id: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/device:GPU:3
Const_2: /job:localhost/replica:0/task:0/device:GPU:3
MatMul_1: /job:localhost/replica:0/task:0/device:GPU:3
Const_1: /job:localhost/replica:0/task:0/device:GPU:2
Const: /job:localhost/replica:0/task:0/device:GPU:2
MatMul: /job:localhost/replica:0/task:0/device:GPU:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[ 44. 56.]
[ 98. 128.]]
Run Code Online (Sandbox Code Playgroud)
用于检查 GPU 是否已找到且可用于以下命令的 Python 代码tensorflow:
## Libraries import
import tensorflow as tf
## Test GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
print('')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
11468 次 |
| 最近记录: |