使用valgrind来发现mpi代码中的错误

Shi*_*bli 7 valgrind mpi

我有一个完美的串行代码,但mpirun -n 2 ./out它给出了以下错误:

./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090
Run Code Online (Sandbox Code Playgroud)

我尝试使用valgrind如:

valgrind --leak-check=yes mpirun -n 2 ./out
Run Code Online (Sandbox Code Playgroud)

我得到以下输出:

==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494== 
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000  0         1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 3497 RUNNING AT orhan
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494== 
==3494== HEAP SUMMARY:
==3494==     in use at exit: 131,120 bytes in 2 blocks
==3494==   total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494== 
==3494== LEAK SUMMARY:
==3494==    definitely lost: 0 bytes in 0 blocks
==3494==    indirectly lost: 0 bytes in 0 blocks
==3494==      possibly lost: 0 bytes in 0 blocks
==3494==    still reachable: 131,120 bytes in 2 blocks
==3494==         suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494== 
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Run Code Online (Sandbox Code Playgroud)

我对valgrind并不擅长,但我理解的是valgrind没有问题.valgrind是否有更好的选择来发现所提到的特定错误的来源?

Rob*_*ham 19

Jonathan Dursi的答案永远不会出错,但是让我加上一个以上的处理器,读取valgrind输出会很痛苦.

而不是输出到控制台,将其转储到日志文件.当然,您无法将两个进程都转储到同一个日志文件中.valgrind将'%p'解释为进程ID,因此您可以获得两个(或更多)日志文件:

mpiexec -np 2 valgrind --leak-check=full \
    --show-reachable=yes --log-file=nc.vg.%p ./noncontig_coll2 -fname blah
Run Code Online (Sandbox Code Playgroud)


Jon*_*rsi 18

请注意,通过上面的调用,

valgrind --leak-check=yes mpirun -n 2 ./out
Run Code Online (Sandbox Code Playgroud)

你在程序上运行valgrind mpirun,它可能已经过广泛的测试和正常工作,而不是./out你知道有问题的程序.

要在测试程序上运行valgrind,您需要执行以下操作:

mpirun -n 2 valgrind --leak-check=yes ./out
Run Code Online (Sandbox Code Playgroud)

其中使用mpirun启动2个进程,每个进程都在运行valgrind --leak-check=yes ./out.

  • 作为旁注,在mpi程序上运行valgrind通常会产生大量误报以便通过; 它总是值得尝试通过valgrind作为一个串行程序运行它.串行程序实际上不会导致崩溃错误,但导致错误的错误内存访问可能仍在发生,但后果不那么严重. (2认同)
  • 另外值得注意的是 valgrind 关于使用 mpi 运行的页面:https://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap (2认同)