为什么gcc/clang使用两个128位xmm寄存器来传递单个值?

clo*_*ead 11 c c++ assembly sse clang

所以我偶然发现了一些我想要理解的东西,因为它让我很头疼.我有以下代码:

#include <stdio.h>
#include <smmintrin.h>

typedef union {
    struct { float x, y, z, w; } v;
    __m128 m;
} vec;

vec __attribute__((noinline)) square(vec a)
{
    vec x = { .m = _mm_mul_ps(a.m, a.m) };
    return x;
}

int main(int argc, char *argv[])
{
    float f = 4.9;
    vec a = (vec){f, f, f, f};
    vec res = square(a); // ?
    printf("%f %f %f %f\n", res.v.x, res.v.y, res.v.z, res.v.w);
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

现在,在我的脑海里,在调用squaremain应该把价值axmm0,这样的square功能可以做到mulps xmm0, xmm0,并用它来完成.

这不是我用clang或gcc编译时会发生的事情.相反,前8个字节a被放入,xmm0接下来的8个字节进入xmm1,使得square函数变得更复杂,因为它需要补充备份.

知道为什么吗?

注意:这是-O3优化.

经过进一步的研究,似乎它与联合类型有关.如果函数采用直__m128,则生成的代码将期望单个寄存器中的值(xmm0).但鉴于它们都应该适合xmm0,我不明白为什么在使用该vec类型时它被分成两个半使用的寄存器.

Jes*_*ter 5

编译器只是试图遵循System V应用程序二进制接口AMD64架构处理器补充,第3.2.3节参数传递所规定的调用约定.

相关要点是:

We first define a number of classes to classify arguments. The
classes are corresponding to AMD64 register classes and defined as:

SSE The class consists of types that fit into a vector register.

SSEUP The class consists of types that fit into a vector register and can
be passed and returned in the upper bytes of it.

The size of each argument gets rounded up to eightbytes.
The basic types are assigned their natural classes:
Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are
in class SSE.

The classification of aggregate (structures and arrays) and union types
works as follows:

If the size of the aggregate exceeds a single eightbyte, each is
classified separately. 
Run Code Online (Sandbox Code Playgroud)

应用上述规则意味着嵌入式结构的x, yz, w对分别被分类为SSE类,这反过来意味着它们必须在两个单独的寄存器中传递.m在这种情况下,成员的存在没有任何影响,您甚至可以删除它.