混合双打时使用int和unsigned int之间的速度差异

lui*_*dro 19 c c++ performance x86

我有一个应用程序,内部循环的一部分基本上是:

double sum = 0;
for (int i = 0; i != N; ++i, ++data, ++x) sum += *data * x;
Run Code Online (Sandbox Code Playgroud)

如果x是unsigned int,那么代码需要3倍于int!

这是一个更大的代码库的一部分,但我把它归结为要点:

#include <iostream>                                      
#include <cstdlib>                                       
#include <vector>
#include <time.h>

typedef unsigned char uint8;

template<typename T>
double moments(const uint8* data, int N, T wrap) {
    T pos = 0;
    double sum = 0.;
    for (int i = 0; i != N; ++i, ++data) {
        sum += *data * pos;
        ++pos;
        if (pos == wrap) pos = 0;
    }
    return sum;
}

template<typename T>
const char* name() { return "unknown"; }

template<>
const char* name<int>() { return "int"; }

template<>
const char* name<unsigned int>() { return "unsigned int"; }

const int Nr_Samples = 10 * 1000;

template<typename T>
void measure(const std::vector<uint8>& data) {
    const uint8* dataptr = &data[0];
    double moments_results[Nr_Samples];
    time_t start, end;
    time(&start);
    for (int i = 0; i != Nr_Samples; ++i) {
        moments_results[i] = moments<T>(dataptr, data.size(), 128);
    }
    time(&end);
    double avg = 0.0;
    for (int i = 0; i != Nr_Samples; ++i) avg += moments_results[i];
    avg /= Nr_Samples;
    std::cout << "With " << name<T>() << ": " << avg << " in " << (end - start) << "secs" << std::endl;
}


int main() {
    std::vector<uint8> data(128*1024);
    for (int i = 0; i != data.size(); ++i) data[i] = std::rand();
    measure<int>(data);
    measure<unsigned int>(data);
    measure<int>(data);
    return 0;
}
Run Code Online (Sandbox Code Playgroud)

编译没有优化:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  test.cpp    
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 9secs
With unsigned int: 1.06353e+09 in 14secs
With int: 1.06353e+09 in 9secs
Run Code Online (Sandbox Code Playgroud)

通过优化:

luispedro@oakeshott:/home/luispedro/tmp/so §g++  -O3  test.cpp
luispedro@oakeshott:/home/luispedro/tmp/so §./a.out
With int: 1.06353e+09 in 3secs
With unsigned int: 1.06353e+09 in 12secs
With int: 1.06353e+09 in 4secs
Run Code Online (Sandbox Code Playgroud)

我不明白为什么速度这么大的差异.我试着从生成的装配中找出它,但我无处可去.有人有什么想法?

这与硬件有关,还是gcc优化机制的限制?我打赌第二个.

我的机器是运行Ubuntu 9.10的Intel 32位.

编辑:自斯蒂芬问,这里是解编译源(来自-O3编译).我相信我得到了主循环:

int版本:

40: 0f b6 14 0b             movzbl (%ebx,%ecx,1),%edx
     sum += *data * pos;
44: 0f b6 d2                movzbl %dl,%edx
47: 0f af d0                imul   %eax,%edx
      ++pos;
4a: 83 c0 01                add    $0x1,%eax
      sum += *data * pos;
4d: 89 95 54 c7 fe ff       mov    %edx,-0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
53: 31 d2                   xor    %edx,%edx
55: 3d 80 00 00 00          cmp    $0x80,%eax
5a: 0f 94 c2                sete   %dl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
5d: 83 c1 01                add    $0x1,%ecx
      sum += *data * pos;
60: db 85 54 c7 fe ff       fildl  -0x138ac(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
66: 83 ea 01                sub    $0x1,%edx
69: 21 d0                   and    %edx,%eax
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6b: 39 f1                   cmp    %esi,%ecx
      sum += *data * pos;
6d: de c1                   faddp  %st,%st(1)
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6f: 75 cf                   jne    40
Run Code Online (Sandbox Code Playgroud)

未签名的版本:

50: 0f b6 34 13             movzbl (%ebx,%edx,1),%esi
      sum += *data * pos;
54: 81 e6 ff 00 00 00       and    $0xff,%esi
5a: 31 ff                   xor    %edi,%edi
5c: 0f af f0                imul   %eax,%esi
      ++pos;
5f: 83 c0 01                add    $0x1,%eax
      if (pos == wrap) pos = 0;
62: 3d 80 00 00 00          cmp    $0x80,%eax
67: 0f 94 c1                sete   %cl
  T pos = 0;
  double sum = 0.;
  for (int i = 0; i != N; ++i, ++data) {
6a: 83 c2 01                add    $0x1,%edx
      sum += *data * pos;
6d: 89 bd 54 c7 fe ff       mov    %edi,-0x138ac(%ebp)
73: 89 b5 50 c7 fe ff       mov    %esi,-0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
79: 89 ce                   mov    %ecx,%esi
7b: 81 e6 ff 00 00 00       and    $0xff,%esi
      sum += *data * pos;
81: df ad 50 c7 fe ff       fildll -0x138b0(%ebp)
      ++pos;
      if (pos == wrap) pos = 0;
87: 83 ee 01                sub    $0x1,%esi
8a: 21 f0                   and    %esi,%eax
  for (int i = 0; i != N; ++i, ++data) {
8c: 3b 95 34 c7 fe ff       cmp    -0x138cc(%ebp),%edx
      sum += *data * pos;
92: de c1                   faddp  %st,%st(1)
  for (int i = 0; i != N; ++i, ++data) {
94: 75 ba                   jne    50
Run Code Online (Sandbox Code Playgroud)

这是-O3版本,这就是源线上下跳跃的原因.谢谢.

Ste*_*non 33

原因如下:许多常见架构(包括x86)都有硬件指令将signed int转换为double,但没有从unsigned到double的硬件转换,因此编译器需要在软件中合成转换.此外,Intel上唯一的无符号乘法是全宽乘法,而有符号乘法可以使用带符号的乘法低指令.

GCC从unsigned int到double的软件转换可能非常不理想(考虑到你观察到的减速幅度,几乎可以肯定),但是当使用有符号整数时,代码的预期行为会更快.

假设一个智能编译器,64位系统上的差异应该小得多,因为64位有符号整数 - >双转换可以用来有效地进行32位无符号转换.

编辑:说明一下,这个:

sum += *data * x;
Run Code Online (Sandbox Code Playgroud)

如果整数变量是有符号的,那么应该按照这些行编译成一些东西:

mov       (data),   %eax
imul      %ecx,     %eax
cvtsi2sd  %eax,     %xmm1
addsd     %xmm1,    %xmm0
Run Code Online (Sandbox Code Playgroud)

另一方面,如果整数变量是无符号的,cvtsi2sd则不能用于进行转换,因此需要软件解决方法.我希望看到这样的事情:

    mov       (data),   %eax
    mul       %ecx            // might be slower than imul
    cvtsi2sd  %eax,     %xmm1 // convert as though signed integer
    test      %eax,     %eax  // check if high bit was set
    jge       1f              // if it was, we need to adjust the converted
    addsd     (2^32),   %xmm1 // value by adding 2^32
1:  addsd     %xmm1,    %xmm0
Run Code Online (Sandbox Code Playgroud)

对于unsigned - > double转换,这将是"可接受的"codegen; 它可能很容易变得更糟.

所有这一切都假设浮点代码生成到SSE(我相信这是Ubuntu工具的默认值,但我可能是错的).

  • 这就是GCC即使在32位模式下所做的事情:将32位无符号扩展为64位有符号并将其转换为64位浮点加载指令. (3认同)