ead*_*ead 13 c optimization gcc x86-64 calling-convention
我尝试了解System V AMD64 的含义-ABI的调用约定并查看以下示例:
struct Vec3{
double x, y, z;
};
struct Vec3 do_something(void);
void use(struct Vec3 * out){
*out = do_something();
}
Run Code Online (Sandbox Code Playgroud)
甲Vec3-variable是型存储器的,因此调用者(use)必须返回的变量分配空间并把它传递为隐藏指针到被叫方(即,do_something)。这是我们在生成的汇编器中看到的(在godbolt上,使用编译-O2):
use:
pushq %rbx
movq %rdi, %rbx ;remember out
subq $32, %rsp ;memory for returned object
movq %rsp, %rdi ;hidden pointer to %rdi
call do_something
movdqu (%rsp), %xmm0 ;copy memory to out
movq 16(%rsp), %rax
movups %xmm0, (%rbx)
movq %rax, 16(%rbx)
addq $32, %rsp ;unwind/restore
popq %rbx
ret
Run Code Online (Sandbox Code Playgroud)
我知道,out可以在其中使用指针的别名(例如,作为全局变量)do_something,因此out不能作为隐藏的指针传递给do_something:如果这样,out它将在内部do_something而不是在do_something返回时进行更改,因此某些计算可能会出错。例如,此版本的do_something将返回错误结果:
struct Vec3 global; //initialized somewhere
struct Vec3 do_something(void){
struct Vec3 res;
res.x = 2*global.x;
res.y = global.y+global.x;
res.z = 0;
return res;
}
Run Code Online (Sandbox Code Playgroud)
如果out其中的全局变量的别名global,并以此作为在通过隐藏的指针%rdi,res也是一个别名global,因为编译器将使用内存指出,通过直接隐藏指针(一种RVO在C),而不实际创建临时对象并在返回时将其复制,res.y则将是2*x+y(如果x,y是的旧值global),而不是x+y其他任何隐藏的指针。
我建议,使用restrict应该可以解决问题,即
void use(struct Vec3 *restrict out){
*out = do_something();
}
Run Code Online (Sandbox Code Playgroud)
因为现在,编译器知道在中没有out可以使用的别名do_something,所以汇编器可以像这样简单:
use:
jmp do_something ; %rdi is now the hidden pointer
Run Code Online (Sandbox Code Playgroud)
但是,无论是gcc还是clang都不是这种情况-汇编程序保持不变(请参见godbolt)。
是什么阻止了使用outas作为隐藏指针?
注意:对于稍微不同的功能签名,可以实现所需的(或非常相似的)行为:
struct Vec3 use_v2(){
return do_something();
}
Run Code Online (Sandbox Code Playgroud)
结果导致(请参见godbolt上):
use_v2:
pushq %r12
movq %rdi, %r12
call do_something
movq %r12, %rax
popq %r12
ret
Run Code Online (Sandbox Code Playgroud)
A function is allowed to assume its return-value object (pointed-to by a hidden pointer) is not the same object as anything else. i.e. that its output pointer (passed as a hidden first arg) doesn't alias anything.
You could think of this as the hidden first arg output pointer having an implicit restrict on it. (Because in the C abstract machine, the return value is a separate object, and the x86-64 System V specifies that the caller provides space. x86-64 SysV doesn't give the caller license to introduce aliasing.)
Using an otherwise-private local as the destination (instead of separate dedicated space and then copying to a real local) is fine, but pointers that may point to something reachable another way must not be used. This requires escape analysis to make sure that a pointer to such a local hasn't been passed outside of the function.
I think the x86-64 SysV calling convention models the C abstract machine here by having the caller provide a real return-value object, not forcing the callee to invent that temporary if needed to make sure all the writes to the retval happened after any other writes. That's not what "the caller provides space for the return value" means, IMO.
That's definitely how GCC and other compilers interpret it in practice, which is a big part of what matters in a calling convention that's been around this long (since a year or two before the first AMD64 silicon, so very early 2000s).
Here's a case where your optimization would break if it were done:
struct Vec3{
double x, y, z;
};
struct Vec3 glob3;
__attribute__((noinline))
struct Vec3 do_something(void) { // copy glob3 to retval in some order
return (struct Vec3){glob3.y, glob3.z, glob3.x};
}
__attribute__((noinline))
void use(struct Vec3 * out){ // copy do_something() result to *out
*out = do_something();
}
void caller(void) {
use(&glob3);
}
Run Code Online (Sandbox Code Playgroud)
With the optimization you're suggesting, do_something's output object would be glob3. But it also reads glob3.
A valid implementation for do_something would be to copy elements from glob3 to (%rdi) in source order, which would do glob3.x = glob3.y before reading glob3.x as the 3rd element of the return value.
That is in fact exactly what gcc -O1 does (Godbolt compiler explorer)
do_something:
movq %rdi, %rax # tmp90, .result_ptr
movsd glob3+8(%rip), %xmm0 # glob3.y, glob3.y
movsd %xmm0, (%rdi) # glob3.y, <retval>.x
movsd glob3+16(%rip), %xmm0 # glob3.z, _2
movsd %xmm0, 8(%rdi) # _2, <retval>.y
movsd glob3(%rip), %xmm0 # glob3.x, _3
movsd %xmm0, 16(%rdi) # _3, <retval>.z
ret
Run Code Online (Sandbox Code Playgroud)
Notice the glob3.y, <retval>.x store before the load of glob3.x.
So without restrict anywhere in the source, GCC already emits asm for do_something that assumes no aliasing between the retval and glob3.
I don't think using struct Vec3 *restrict out wouldn't help at all: that only tells the compiler that inside use() you won't access the *out object through any other name. Since use() doesn't reference glob3, it's not UB to pass &glob3 as an arg to a restrict version of use.
I may be wrong here; @M.M argues in comments that *restrict out might make this optimization safe because the execution of do_something() happens during out(). (Compilers still don't actually do it, but maybe they would be allowed to for restrict pointers.)
Update: Richard Biener said in the GCC missed-optimization bug-report that M.M is correct, and if the compiler can prove that the function returns normally (not exception or longjmp), the optimization is legal in theory (but still not something GCC is likely to look for):
If so, restrict would make this optimization safe if we can prove that do_something is "noexcept" and doesn't longjmp.
Yes.
There's a noexecpt declaration, but there isn't (AFAIK) a nolongjmp declaration you can put on a prototype.
So that means it's only possible (even in theory) as an inter-procedural optimization when we can see the other function's body. Unless noexcept also means no longjmp.