And*_*ase 8 memory graphics-processing-unit cuda
我们使用了大量的 GPGPU 计算(主要使用 CUDA,但也使用一些 OpenCL)。通常,当用户运行代码时,代码仅在我们的一台主机上出现内存错误。我怀疑其中一张卡有问题。有时它会导致整个系统瘫痪,有时程序会崩溃。
全面测试 GPU 是否可能出现故障的最简单、最快和最彻底的方法是什么?
我知道有些程序是 nvidia 的 CUDA SDK 的一部分:
deviceQuery
nvidia-smi
Run Code Online (Sandbox Code Playgroud)
但我需要更彻底的东西。建议?经验?
事实上的标准似乎是CUDA GPU Memtest。正如 @c2h5oh 提到的,它看起来像是基于 memtest86 测试模式,所以我确信它做得很好。它在我测试的高端 GPU 上运行速度相对较快(Quadro 6000 上运行 30 分钟,Tesla C2075 上运行 20 分钟)。它在操作系统内部运行(与 memtest 不同),因此监控有点不同。您可能希望将 stdout 和 stderr 输出到文件中以供稍后查看。因此,请考虑像这样运行它,以防丢失终端输出,您可以查找测试发现的内容:
cuda_memtest 2>cuda_memtest.stderr 1>cuda_memtest.stdout &
tail -f cuda_memtest.stdout &
tail -f cuda_memtest.stderr &
Run Code Online (Sandbox Code Playgroud)
您还需要确保没有人在使用该系统和/或卡。您可以使用以下方法将 GPU 设置为独占模式:
nvidia-smi --compute-mode=EXCLUSIVE_PROCESS
Run Code Online (Sandbox Code Playgroud)
以下是 Quadro 和 Tesla 样本运行的一些输出,以防您对给出的测试信息感兴趣:
[09/07/2012 11:56:22][hydro][0]:Running cuda memtest, version 1.2.2
[09/07/2012 11:56:23][hydro][0]:Warning: Getting serial number failed
[09/07/2012 11:56:23][hydro][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012
[09/07/2012 11:56:23][hydro][0]:num_gpus=1
[09/07/2012 11:56:23][hydro][0]:Device name=Quadro 6000, global memory size=6441992192
[09/07/2012 11:56:23][hydro][0]:major=2, minor=0
[09/07/2012 11:56:24][hydro][0]:Attached to device 0 successfully.
[09/07/2012 11:56:24][hydro][0]:Allocated 6040 MB
[09/07/2012 11:56:24][hydro][0]:Test0 [Walking 1 bit]
[09/07/2012 11:56:30][hydro][0]:Test0 finished in 5.7 seconds
[09/07/2012 11:56:30][hydro][0]:Test1 [Own address test]
[09/07/2012 11:56:33][hydro][0]:Test1 finished in 3.5 seconds
[09/07/2012 11:56:33][hydro][0]:Test2 [Moving inversions, ones&zeros]
[09/07/2012 11:57:05][hydro][0]:Test2 finished in 32.3 seconds
[09/07/2012 11:57:05][hydro][0]:Test3 [Moving inversions, 8 bit pat]
[09/07/2012 11:57:37][hydro][0]:Test3 finished in 31.9 seconds
[09/07/2012 11:57:37][hydro][0]:Test4 [Moving inversions, random pattern]
[09/07/2012 11:57:53][hydro][0]:Test4 finished in 15.9 seconds
[09/07/2012 11:57:53][hydro][0]:Test5 [Block move, 64 moves]
[09/07/2012 11:57:59][hydro][0]:Test5 finished in 6.3 seconds
[09/07/2012 11:57:59][hydro][0]:Test6 [Moving inversions, 32 bit pat]
[09/07/2012 12:18:46][hydro][0]:Test6 finished in 1246.6 seconds
[09/07/2012 12:18:46][hydro][0]:Test7 [Random number sequence]
[09/07/2012 12:19:06][hydro][0]:Test7 finished in 19.8 seconds
[09/07/2012 12:19:06][hydro][0]:Test8 [Modulo 20, random pattern]
[09/07/2012 12:19:06][hydro][0]:test8[mod test]: p1=0x13472f5f, p2=0xecb8d0a0
[09/07/2012 12:20:34][hydro][0]:Test8 finished in 88.0 seconds
[09/07/2012 12:20:34][hydro][0]:Test10 [Memory stress test]
[09/07/2012 12:20:34][hydro][0]:Test10 with pattern=0x55f6c69858704128
[09/07/2012 12:21:11][hydro][0]:Test10 finished in 36.8 seconds
[09/07/2012 12:21:11][hydro][0]:Test0 [Walking 1 bit]
[09/07/2012 12:21:16][hydro][0]:Test0 finished in 5.8 seconds
[09/06/2012 18:49:07][hydro][0]:Running cuda memtest, version 1.2.2
[09/06/2012 18:49:10][hydro][0]:Warning: Getting serial number failed
[09/06/2012 18:49:10][hydro][0]:NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.41 Fri Apr 6 23:18:58 PDT 2012
[09/06/2012 18:49:10][hydro][0]:num_gpus=1
[09/06/2012 18:49:10][hydro][0]:Device name=Tesla C2075, global memory size=5636292608
[09/06/2012 18:49:10][hydro][0]:major=2, minor=0
[09/06/2012 18:49:11][hydro][0]:Attached to device 0 successfully.
[09/06/2012 18:49:11][hydro][0]:Allocated 5273 MB
[09/06/2012 18:49:11][hydro][0]:Test0 [Walking 1 bit]
[09/06/2012 18:49:22][hydro][0]:Test0 finished in 11.1 seconds
[09/06/2012 18:49:22][hydro][0]:Test1 [Own address test]
[09/06/2012 18:49:25][hydro][0]:Test1 finished in 3.1 seconds
[09/06/2012 18:49:25][hydro][0]:Test2 [Moving inversions, ones&zeros]
[09/06/2012 18:49:52][hydro][0]:Test2 finished in 27.4 seconds
[09/06/2012 18:49:52][hydro][0]:Test3 [Moving inversions, 8 bit pat]
[09/06/2012 18:50:20][hydro][0]:Test3 finished in 27.9 seconds
[09/06/2012 18:50:20][hydro][0]:Test4 [Moving inversions, random pattern]
[09/06/2012 18:50:34][hydro][0]:Test4 finished in 13.7 seconds
[09/06/2012 18:50:34][hydro][0]:Test5 [Block move, 64 moves]
[09/06/2012 18:50:39][hydro][0]:Test5 finished in 5.5 seconds
[09/06/2012 18:50:39][hydro][0]:Test6 [Moving inversions, 32 bit pat]
[09/06/2012 19:08:34][hydro][0]:Test6 finished in 1074.9 seconds
[09/06/2012 19:08:34][hydro][0]:Test7 [Random number sequence]
[09/06/2012 19:08:51][hydro][0]:Test7 finished in 17.1 seconds
[09/06/2012 19:08:51][hydro][0]:Test8 [Modulo 20, random pattern]
[09/06/2012 19:08:51][hydro][0]:test8[mod test]: p1=0x63136646, p2=0x9cec99b9
[09/06/2012 19:10:10][hydro][0]:Test8 finished in 78.4 seconds
[09/06/2012 19:10:10][hydro][0]:Test10 [Memory stress test]
[09/06/2012 19:10:10][hydro][0]:Test10 with pattern=0x26341d134a89ac2b
[09/06/2012 19:10:39][hydro][0]:Test10 finished in 29.0 seconds
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
7295 次 |
| 最近记录: |