我已经调试了该程序2周。它只有93行。但是我仍然找不到错误。请帮我。
在我的笔记本电脑上,该程序正常。但是,当我在我的实验室集群(中国上海超级计算中心和中国济南超级计算中心)上运行时,它卡住了。
该程序的逻辑非常简单。有2个MPI流程。一个是主机(pid = 0),另一个是从机(pid = 1)。主机等待标签= 0上的请求。从设备每秒向主机标签0发送消息,并等待ACK消息。一旦主机收到请求,主机将向从机标签100发送ACK消息。
问题是几秒钟后程序将卡住。主机将停留在MPI_Recv上,等待标签0上的请求。从站将停留在MPI_Ssend,尝试向主标记0发送消息。该MPI通信应彼此匹配。但我不知道为什么会卡住。
一些提示:在以下情况下程序不会卡住:
1,在pthread_create(&tid,&attr,master_server_handler,NULL)之后添加sleep()函数; 在void * master_server(void * null_arg)函数中。
要么
2.使用joinable pthread属性而不是detachable属性创建master_server_handler。(pthread_create(&tid,&attr,master_server_handler,NULL);替换为pthread_create(&tid,NULL,master_server_handler,NULL);)
要么
3.使用master_server_handler函数代替pthread_create(&tid,&attr,master_server_handler,NULL);。
要么
4.用MPI_Send替换void * master_server_handler(void * arg)中的MPI_Ssend。
在每种情况下,该程序都可以。所有这些修改都可以在程序的注释中找到。
我不知道为什么会卡住。我尝试过openmpi和mpich2。该程序将卡在它们两个上。
任何提示请...
如果需要,我可以提供我实验室的VPN。您可以在我的实验室中登录集群。(电子邮件:674322022@qq.com)
顺便说一句:我在编译openmpi和mpich2时启用了线程支持的参数。对于openmpi,参数为--with-threads = poxis --enable-mpi-thread-multiple。Mpich2是--enable-threads。
我实验室中的机器是CentOS。uname -a的输出是Linux node5 2.6.18-238.12.1.el5xen#1 SMP Tue May 31 14:02:29 EDT 2011 x86_64 x86_64 x86_64 GNU / Linux。
我使用以下程序运行该程序: mpiexec -n 2 ./a.out
以下是程序输出的源代码,以及程序卡住时的追溯信息。
#include "stdio.h"
#include "pthread.h"
#include "stdlib.h"
#include "string.h"
#include "mpi.h"
void send_heart_beat();
void *heart_beat_daemon(void *null_arg);
void *master_server(void *null_arg);
void *master_server_handler(void *arg);
int main(int argc,char *argv[])
{
int p,id;
pthread_t tid;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&p);
MPI_Comm_rank(MPI_COMM_WORLD,&id);
if(id==0)
{
//master
pthread_create(&tid,NULL,master_server,NULL);
pthread_join(tid,NULL);
}
else
{
//slave
pthread_create(&tid,NULL,heart_beat_daemon,NULL);
pthread_join(tid,NULL);
}
MPI_Finalize();
return 0;
}
void *heart_beat_daemon(void *null_arg)
{
while(1)
{
sleep(1);
send_heart_beat();
}
}
void send_heart_beat()
{
char send_msg[5];
char ack_msg[5];
strcpy(send_msg,"AAAA");
MPI_Ssend(send_msg,5,MPI_CHAR,0,0,MPI_COMM_WORLD);
MPI_Recv(ack_msg,5,MPI_CHAR,0,100,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
}
void *master_server(void *null_arg)
{
char msg[5];
pthread_t tid;
pthread_attr_t attr;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr,PTHREAD_CREATE_DETACHED);
while(1)
{
MPI_Recv(msg,5,MPI_CHAR,1,0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
pthread_create(&tid,&attr,master_server_handler,NULL);
// sleep(2);
// master_server_handler(NULL);
// pthread_create(&tid,NULL,master_server_handler,fun_arg);
// pthread_join(tid,NULL);
}
}
void *master_server_handler(void *arg)
{
static int count;
char ack[5];
count ++;
printf("recved a msg %d\n",count);
strcpy(ack,"ACK:");
MPI_Ssend(ack,5,MPI_CHAR,1,100,MPI_COMM_WORLD);
// MPI_Send(ack,5,MPI_CHAR,1,100,MPI_COMM_WORLD);
}
Run Code Online (Sandbox Code Playgroud)
recved a msg 1
recved a msg 2
recved a msg 3
recved a msg 4
recved a msg 5
recved a msg 6
recved a msg 7
recved a msg 8
recved a msg 9
recved a msg 10
recved a msg 11
recved a msg 12
recved a msg 13
recved a msg 14
recved a msg 15
Run Code Online (Sandbox Code Playgroud)
(gdb) bt
#0 opal_progress () at runtime/opal_progress.c:175
#1 0x00002b17ed288f75 in opal_condition_wait (addr=<value optimized out>,
count=<value optimized out>, datatype=<value optimized out>, src=1, tag=0,
comm=0x601520, status=0x0) at ../../../../opal/threads/condition.h:99
#2 ompi_request_wait_completion (addr=<value optimized out>,
count=<value optimized out>, datatype=<value optimized out>, src=1, tag=0,
comm=0x601520, status=0x0) at ../../../../ompi/request/request.h:377
#3 mca_pml_ob1_recv (addr=<value optimized out>, count=<value optimized out>,
datatype=<value optimized out>, src=1, tag=0, comm=0x601520, status=0x0)
at pml_ob1_irecv.c:105
#4 0x00002b17ed1ef049 in PMPI_Recv (buf=0x2b17f2495120, count=5,
type=0x601320, source=1, tag=0, comm=0x601520, status=0x0) at precv.c:78
#5 0x0000000000400d75 in master_server (null_arg=0x0) at main.c:73
#6 0x0000003b5a00683d in start_thread () from /lib64/libpthread.so.0
#7 0x0000003b594d526d in clone () from /lib64/libc.so.6
Run Code Online (Sandbox Code Playgroud)
(gdb) bt
#0 0x00002adff87ef975 in opal_atomic_cmpset_32 (btl=<value optimized out>, endpoint=<value optimized out>,
registration=0x0, convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3)
at ../../../../opal/include/opal/sys/amd64/atomic.h:85
#1 opal_atomic_lifo_pop (btl=<value optimized out>, endpoint=<value optimized out>, registration=0x0,
convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3)
at ../../../../opal/class/opal_atomic_lifo.h:100
#2 mca_btl_sm_prepare_src (btl=<value optimized out>, endpoint=<value optimized out>, registration=0x0,
convertor=0x124e46a8, order=0 '\000', reserve=32, size=0x2adffda74fe8, flags=3) at btl_sm.c:697
#3 0x00002adff8877678 in mca_bml_base_prepare_src (sendreq=0x124e4600, bml_btl=0x124ea860, size=5, flags=0)
at ../../../../ompi/mca/bml/bml.h:339
#4 mca_pml_ob1_send_request_start_rndv (sendreq=0x124e4600, bml_btl=0x124ea860, size=5, flags=0)
at pml_ob1_sendreq.c:815
#5 0x00002adff8869e82 in mca_pml_ob1_send_request_start (buf=0x2adffda75100, count=5,
datatype=<value optimized out>, dst=0, tag=0, sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x601520)
at pml_ob1_sendreq.h:363
#6 mca_pml_ob1_send (buf=0x2adffda75100, count=5, datatype=<value optimized out>, dst=0, tag=0,
sendmode=MCA_PML_BASE_SEND_SYNCHRONOUS, comm=0x601520) at pml_ob1_isend.c:119
#7 0x00002adff87d2be6 in PMPI_Ssend (buf=0x2adffda75100, count=5, type=0x601320, dest=0, tag=0,
comm=0x601520) at pssend.c:76
#8 0x0000000000400cf4 in send_heart_beat () at main.c:55
#9 0x0000000000400cb6 in heart_beat_daemon (null_arg=0x0) at main.c:44
#10 0x0000003b5a00683d in start_thread () from /lib64/libpthread.so.0
#11 0x0000003b594d526d in clone () from /lib64/libc.so.6
Run Code Online (Sandbox Code Playgroud)
MPI提供了四种不同层次的线程支持:MPI_THREAD_SINGLE,MPI_THREAD_SERIALIZED,MPI_THREAD_FUNNELED,和MPI_THREAD_MULTIPLE。为了能够同时从不同的线程进行MPI调用,您必须使用MPI_THREAD_MULTIPLE线程支持级别初始化MPI,并确保该库实际上提供了该级别:
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided < MPI_THREAD_MULTIPLE)
{
printf("Error: the MPI library doesn't provide the required thread level\n");
MPI_Abort(MPI_COMM_WORLD, 0);
}
Run Code Online (Sandbox Code Playgroud)
如果您调用MPI_Init而不是MPI_Init_thread,则库可以自由选择其创建者认为最佳的默认线程支持级别。对于Open MPI MPI_THREAD_SINGLE,即,即不支持线程。您可以通过设置环境变量来控制默认级别,OMPI_MPI_THREAD_LEVEL但是不建议这样做- MPI_Init_thread应该改用它。