在使用 2 个 Tesla K80 卡配置系统后,我注意到在运行时nvidia-smi4 个 GPU 中的一个负载很重,尽管“没有找到正在运行的进程”。为什么会发生这种情况,我该如何纠正?
这是来自的输出nvidia-smi:
? compute-0-1: ~/> nvidia-smi
Mon Sep 26 14:48:00 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77 Driver Version: 361.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:05:00.0 Off | 0 |
| N/A 34C P0 57W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:06:00.0 Off | 0 |
| N/A 26C P0 76W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:85:00.0 Off | 0 |
| N/A 33C P0 60W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:86:00.0 Off | 0 |
| N/A 24C P0 74W / 149W | 0MiB / 11441MiB | 71% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
这个 nvidia 论坛解决了这个问题。要解决此问题,请启用持久模式:
sudo nvidia-smi -pm 1
Run Code Online (Sandbox Code Playgroud)
运行此命令后,nvidia-smi结果如下:
? compute-0-1: ~/> nvidia-smi Mon Sep 26 14:55:21 2016
Mon Sep 26 14:55:21 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.77 Driver Version: 361.77 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 0000:05:00.0 Off | 0 |
| N/A 36C P8 27W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 0000:06:00.0 Off | 0 |
| N/A 28C P8 30W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 On | 0000:85:00.0 Off | 0 |
| N/A 37C P8 28W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 On | 0000:86:00.0 Off | 0 |
| N/A 27C P8 72W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7338 次 |
| 最近记录: |