rit*_*ter 18 c c++ fork system libc++
在Linux系统上,我试图通过调用在运行时调用程序system().系统调用以不等于零的返回码退出.
调用WEXITSTATUS错误代码给出"127".
根据系统的手册页,此代码表示/bin/sh无法调用:
如果/bin/sh无法执行,退出状态将是执行命令的退出状态exit(127).
我查了一下:/bin/sh是一个链接bash.bash在那儿.我可以从shell执行它.
现在,我怎么才能找出/bin/sh无法调用的原因?任何内核历史或什么?
编辑:
在这个过程非常有用的提示(见下文)之后strace -f -p <PID>.这是我在system通话中得到的:
Process 16080 detached
[pid 11779] <... select resumed> ) = ? ERESTARTNOHAND (To be restarted)
[pid 11774] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 127}], 0, NULL) = 16080
[pid 11779] --- SIGCHLD (Child exited) @ 0 (0) ---
[pid 11779] rt_sigaction(SIGCHLD, {0x2ae0ff898ae2, [CHLD], SA_RESTORER|SA_RESTART, 0x32dd2302d0}, <unfinished ...>
[pid 11774] rt_sigaction(SIGINT, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0}, <unfinished ...>
[pid 11779] <... rt_sigaction resumed> {0x2ae0ff898ae2, [CHLD], SA_RESTORER|SA_RESTART, 0x32dd2302d0}, 8) = 0
[pid 11779] sendto(5, "a", 1, 0, NULL, 0 <unfinished ...>
[pid 11774] <... rt_sigaction resumed> NULL, 8) = 0
[pid 11779] <... sendto resumed> ) = 1
[pid 11779] rt_sigreturn(0x2 <unfinished ...>
[pid 11774] rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0}, <unfinished ...>
[pid 11779] <... rt_sigreturn resumed> ) = -1 EINTR (Interrupted system call)
[pid 11779] select(16, [9 15], [], NULL, NULL <unfinished ...>
[pid 11774] <... rt_sigaction resumed> NULL, 8) = 0
[pid 11774] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 11774] write(1, "Problems calling nvcc jitter: ex"..., 49) = 49
[pid 11774] rt_sigaction(SIGINT, {0x1, [], SA_RESTORER, 0x32dd2302d0}, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0}, 8) = 0
[pid 11774] rt_sigaction(SIGQUIT, {0x1, [], SA_RESTORER, 0x32dd2302d0}, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0}, 8) = 0
[pid 11774] rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
[pid 11774] clone(Process 16081 attached (waiting for parent)
Process 16081 resumed (parent 11774 ready)
child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x7fff0177ab68) = 16081
[pid 16081] rt_sigaction(SIGINT, {0x2ae1042070f0, [], SA_RESTORER|SA_SIGINFO, 0x32dd2302d0}, <unfinished ...>
[pid 11774] wait4(16081, Process 11774 suspended
<unfinished ...>
[pid 16081] <... rt_sigaction resumed> NULL, 8) = 0
[pid 16081] rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x32dd2302d0}, NULL, 8) = 0
[pid 16081] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 16081] execve("/bin/sh", ["sh", "-c", 0xdda1d98], [/* 58 vars */]) = -1 EFAULT (Bad address)
[pid 16081] exit_group(127) = ?
Process 11774 resumed
Run Code Online (Sandbox Code Playgroud)
说到它的电话/bin/sh说不好的地址.为什么 ?
编辑:
这里涉及失败的整个部分system(这里已经安全地复制到缓冲区):
std::ostringstream jit_command;
jit_command << string(CUDA_DIR) << "/bin/nvcc -v --ptxas-options=-v ";
jit_command << "-arch=" << string(GPUARCH);
jit_command << " -m64 --compiler-options -fPIC,-shared -link ";
jit_command << fname_src << " -I$LIB_PATH/include -o " << fname_dest;
string gen = jit_command.str();
cout << gen << endl;
char* cmd = new(nothrow) char[gen.size()+1];
if (!cmd) ___error_exit("no memory for jitter command");
strcpy(cmd,gen.c_str());
int ret;
if (ret=system(cmd)) {
cout << "Problems calling nvcc jitter: ";
if (WIFEXITED(ret)) {
printf("exited, status=%d\n", WEXITSTATUS(ret));
} else if (WIFSIGNALED(ret)) {
printf("killed by signal %d\n", WTERMSIG(ret));
} else if (WIFSTOPPED(ret)) {
printf("stopped by signal %d\n", WSTOPSIG(ret));
} else if (WIFCONTINUED(ret)) {
printf("continued\n");
} else {
printf("not recognized\n");
}
cout << "Checking shell.. ";
if(system(NULL))
cout << "ok!\n";
else
cout << "nope!\n";
__error_exit("Nvcc error\n");
}
delete[] cmd;
return true;
Run Code Online (Sandbox Code Playgroud)
输出:
/usr/local/cuda/bin/nvcc -v --ptxas-options=-v -arch=sm_20 -m64 --compiler-options -fPIC,-shared -link bench_cudp_Oku2fm.cu -I$LIB_PATH/include -o bench_cudp_Oku2fm.o
Problems calling nvcc jitter: exited, status=127
Checking shell.. ok!
Run Code Online (Sandbox Code Playgroud)
编辑(代码的第一个版本):
string gen = jit_command.str();
cout << gen << endl;
int ret;
if (ret=system(gen.c_str())) {
....
Run Code Online (Sandbox Code Playgroud)
字符串创建的复杂性不是问题所在.如图strace所示,"坏地址"就是问题所在.它是合法的字符串.不应该出现"坏地址".
据我所知,std::string::c_str()返回值const char *可能指向libc ++的临时空间,其中可能保留字符串的只读副本.
不幸的是,错误并不是真正可重现的.system在失败之前多次成功调用.
我不想仓促,但它闻起来像是内核,libc或硬件中的错误.
编辑:
我为失败的系统调用生成了一个更详细的strace输出(strace -f -v -s 2048 -e trace=process -p $!)execve:
首先是后续的电话:
[pid 2506] execve("/bin/sh", ["sh", "-c", "/usr/local/cuda/bin/nvcc -v --ptxas-options=-v -arch=sm_20 -m64 --compiler-options -fPIC,-shared -link /home/user/toolchain/kernels-empty/bench_cudp_U11PSy.cu -I$LIB_PATH/include -o /home/user/toolchain/kernels-empty/bench_cudp_U11PSy.o"], ["MODULE_VERSION_STACK=3.2.8", ... ]) = 0
Run Code Online (Sandbox Code Playgroud)
现在失败的一个:
[pid 17398] execve("/bin/sh", ["sh", "-c", 0x14595af0], <list of vars>) = -1 EFAULT (Bad address)
Run Code Online (Sandbox Code Playgroud)
这<list of vars>是完全相同的.它似乎不是导致错误地址的环境变量列表.正如Chris Dodd所提到的,execve的第三个参数是原始指针0x14595af0,strace认为(并且内核同意)是无效的.strace不会将其识别为字符串(因此它会打印十六进制值而不是字符串).
编辑:
我插入了指针值的print cmd,看看父进程中该指针的值是什么:
string gen = jit_command.str();
cout << gen << endl;
char* cmd = new(nothrow) char[gen.size()+1];
if (!cmd) __error_exit("no memory for jitter command");
strcpy(cmd,gen.c_str());
cout << "cmd = " << (void*)cmd << endl;
int ret;
if (ret=system(cmd)) {
cout << "failed cmd = " << (void*)cmd << endl;
cout << "Problems calling nvcc jitter: ";
Run Code Online (Sandbox Code Playgroud)
输出(用于失败的呼叫):
cmd = 0x14595af0
failed cmd = 0x14595af0
Problems calling nvcc jitter: exited, status=127
Checking shell.. ok!
Run Code Online (Sandbox Code Playgroud)
它与第3个参数的指针值相同strace.(我更新了strace上面的输出).
关于cmd指针的32位查看:我检查了cmd指针的值以进行后续调用.看不出结构上的任何差异.这是cmd当时system调用成功的值之一:
cmd = 0x145d4f20
Run Code Online (Sandbox Code Playgroud)
因此,在system调用之前指针有效.由于strace上面的输出表明子进程(在调用之后fork)接收到正确的指针值.但是,由于某种原因,指针值在子进程中被标记为无效.
现在我们认为它要么:
编辑:
同时让我发布一个解决方法.它是如此愚蠢,被迫实施类似的东西...但它的工作原理.因此,如果system调用失败,将执行以下代码块.它分配新的命令字符串并重试,直到成功(不是无限期).
list<char*> listPtr;
int maxtry=1000;
do{
char* tmp = new(nothrow) char[gen.size()+1];
if (!tmp) __error_exit("no memory for jitter command");
strcpy(tmp,gen.c_str());
listPtr.push_back( tmp );
} while ((ret=system(listPtr.back())) && (--maxtry>0));
while(listPtr.size()) {
delete[] listPtr.back();
listPtr.pop_back();
}
Run Code Online (Sandbox Code Playgroud)
编辑:
我刚看到在一次特定运行中的这种解决方法不起作用.它全程完成,1000次尝试,全部使用新分配的cmd命令字符串.全部1000失败.不仅如此.我尝试了不同的Linux主机(相同的Linux /软件配置).
考虑到这一点,可能会排除硬件问题.(必须在2个物理上不同的主机上).还是一个内核bug?
编辑:
torek,我会尝试安装修改过的system电话.给我一些时间.
这很奇怪. strace理解execve的参数是(指向)字符串,因此它打印出指向的字符串,除非指针无效 - 在这种情况下,它打印出指针的原始十六进制值.所以直线
[pid 16081] execve("/bin/sh", ["sh", "-c", 0xdda1d98], [/* 58 vars */]) = -1 EFAULT (Bad address)
Run Code Online (Sandbox Code Playgroud)
完全有道理 - execve的第三个参数是原始指针0xdda1d98,strace认为(并且内核同意)是无效的.所以问题是,无效指针是如何到达此处的.这应该是cmd,刚刚从new返回.
我建议把这条线放好
printf("cmd=%p\n", cmd);
Run Code Online (Sandbox Code Playgroud)
在系统调用之前,找出C代码认为指针是什么.
看看剩下的部分,它看起来像是在64位系统上运行(来自打印的指针),无效的0xdda1d98看起来像是32位指针,所以它似乎是某种32/64位搞砸(有人只保存和恢复64位寄存器的32位,或者其他一些).