pthread 和 printf 的 C 性能都很差

Question

pthread 和 printf 的 C 性能都很差

Ant*_* R. 2 c performance multithreading pthreads

我正在使用大型数组测试 Linux 的 ac 代码以测量线程性能，当线程增加到最大内核（Intel 4770 为 8 个）时，应用程序可以很好地扩展，但这仅适用于我的代码的纯数学部分。

如果我为结果数组添加 printf 部分，那么即使重定向到文件，时间也会变得太大，从几秒到几分钟，而当 printf 这些数组应该只添加几秒时。

代码：

(gcc 7.5.0-Ubuntu 18.04)

没有 printf 循环：

gcc -O3 -m64 exp_multi.c -pthread -lm

Run Code Online (Sandbox Code Playgroud)

使用 printf 循环：

gcc -DPRINT_ARRAY -O3 -m64 exp_multi.c -pthread -lm

Run Code Online (Sandbox Code Playgroud)

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 8
static double xv[MAXSIZE];
static double yv[MAXSIZE];

/*    gcc -O3 -m64 exp_multi.c -pthread -lm    */


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    double p;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i++) {

      xv[i]=p;
      yv[i]=exp(p);

    p += 0.00001;
    }

  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

#ifdef PRINT_ARRAY
    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20lf, e^x=%.20lf\n",xv[i],yv[i]);
    }
#endif

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

pthread_create 中的 malloc 传递一个整数作为最后一个参数，如本文中所建议的。

我尝试过，没有成功，clang，添加 free(tid) 指令，避免使用 malloc 指令，反向循环，只有 1 个一维数组，1 个没有 pthread 的线程版本。

EDIT2：我认为 exp 函数是处理器资源密集型的，可能受到处理器一代实现的每核缓存或 SIMD 资源的影响。以下示例代码基于Stack Overflow 上发布的许可代码。

不管有没有 printf 循环，这段代码都运行得很快，math.h 的 exp 几年前已经得到了改进，它可以快 40 倍左右，至少在 Intel 4770 (Haswell) 上是这样，这个链接是一个已知的测试代码数学库与 SSE2 的比较，现在数学的 exp 速度应该接近针对浮点和 x8 并行计算优化的 AVX2 算法。

测试结果：expf 与其他 SSE2 算法（exp_ps）：

sinf .. ->            55.5 millions of vector evaluations/second ->  12 cycles/value
cosf .. ->            57.3 millions of vector evaluations/second ->  11 cycles/value
sincos (x87) .. ->    9.1 millions of vector evaluations/second ->   71 cycles/value
expf .. ->            61.4 millions of vector evaluations/second ->  11 cycles/value
logf .. ->            55.6 millions of vector evaluations/second ->  12 cycles/value
cephes_sinf .. ->     52.5 millions of vector evaluations/second ->  12 cycles/value
cephes_cosf .. ->     41.9 millions of vector evaluations/second ->  15 cycles/value
cephes_expf .. ->     18.3 millions of vector evaluations/second ->  35 cycles/value
cephes_logf .. ->     20.2 millions of vector evaluations/second ->  32 cycles/value
sin_ps .. ->          54.1 millions of vector evaluations/second ->  12 cycles/value
cos_ps .. ->          54.8 millions of vector evaluations/second ->  12 cycles/value
sincos_ps .. ->       54.6 millions of vector evaluations/second ->  12 cycles/value
exp_ps .. ->          42.6 millions of vector evaluations/second ->  15 cycles/value
log_ps .. ->          41.0 millions of vector evaluations/second ->  16 cycles/value

Run Code Online (Sandbox Code Playgroud)

/* Performance test exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

/*    gcc -O3 -m64 -Wall -mavx2 -march=haswell  expc.c -lm     */

#include <stdio.h>
#include <immintrin.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5


__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}

int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX;
    xp = xv;
    yp = yv;

    for (i = 0; i < MAXSIZE; i += 8) {

    __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
    __m256 y = exp256_ps(x);

    _mm256_store_ps(xp,x);
    _mm256_store_ps(yp,y);

    xp += 8;
    yp += 8;
    p += 0.00008;
    }

  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

为了进行比较，这是数学库中的 exp (x)、单线程和浮点的代码示例。

#include <stdio.h>
#include <math.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
/*    gcc -O3 -m64 exp_st.c -lm    */


int main(){
    int r;
    int i;
    float p;
    static float xv[MAXSIZE];
    static float yv[MAXSIZE];


  for (r = 0; r < REIT; r++) {
    p = XXX;

    for (i = 0; i < MAXSIZE; i++) {

      xv[i]=p;
      yv[i]=expf(p);

    p += 0.00001;
    }
  }

  for (i=0;i<MAXSIZE;i++){
      printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
  }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

解决方案：正如 Andreas Wenzel 所说，gcc 编译器足够智能，它决定没有必要将结果实际写入数组，这些写入会被编译器优化掉。在我根据新信息进行新的性能测试之后，或者在我犯了几个错误或我假设了错误的事实之前，结果似乎更清晰：exp (double arg)，或expf(float arg)，即x2+ exp(double arg) ，已得到改进，但它不如快速 AVX2 算法（x8 并行浮点 arg），后者比 SSE2 算法（x4 并行浮点 arg）快 6 倍左右。以下是一些结果，符合 Intel 超线程 CPU 的预期（SSE2 算法除外）：

exp（双参数）单线程：18 分 46 秒

exp (double arg) 4 个线程：5 分 4 秒

exp (双参数) 8 个线程：4 分 28 秒

expf (float arg) 单线程：7 分 32 秒

expf (float arg) 4 个线程：1 分 58 秒

expf (float arg) 8 个线程：1 分 41 秒

相对误差**：

i           x                     y = expf(x)           double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059411496e-02   exp_dbl = 8.059411715e-02   rel_err =-2.716802480e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.867173195e-02   y = 9.620664716e-01   exp_dbl = 9.620664730e-01   rel_err =-1.481617428e-09
i = 620000  x = 1.201261759e+00   y = 3.324308872e+00   exp_dbl = 3.324308753e+00   rel_err = 3.571995830e-08
i = 744000  x = 2.441616058e+00   y = 1.149159718e+01   exp_dbl = 1.149159684e+01   rel_err = 2.955980805e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

Run Code Online (Sandbox Code Playgroud)

* Julien Pommier的 SSE2 算法，x6,8 速度从 1 个线程增加到 8 个线程。我的性能测试代码使用aligned(16)作为传递给库的向量/4浮点数组的typedef联合，而不是对齐整个浮点数组。这可能会导致性能降低，至少对于其他 AVX2 代码而言，其多线程性能改进似乎对英特尔超线程也有好处，但速度较低，时间在 x2.5-x1.5 之间增加。也许 SSE2 代码可以通过更好的数组对齐来加速，但我无法改进：

exp_ps（x4 并行浮点参数）单线程：12 分 7 秒

exp_ps（x4 并行浮点参数）4 线程：3 分 10 秒

exp_ps（x4 并行浮点参数）8 个线程：1 分 46 秒

相对误差**：

i           x                     y = exp_ps(x)         double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.518329620e+00   y = 8.059412241e-02   exp_dbl = 8.059411715e-02   rel_err = 6.527768787e-08
i = 372000  x =-1.278343201e+00   y = 2.784983218e-01   exp_dbl = 2.784983343e-01   rel_err =-4.490403948e-08
i = 496000  x =-3.977407143e-02   y = 9.610065222e-01   exp_dbl = 9.610065335e-01   rel_err =-1.174323454e-08
i = 620000  x = 1.200158238e+00   y = 3.320642233e+00   exp_dbl = 3.320642334e+00   rel_err =-3.054731957e-08
i = 744000  x = 2.441616058e+00   y = 1.149159622e+01   exp_dbl = 1.149159684e+01   rel_err =-5.342903415e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.921588898e+00   y = 1.372204742e+02   exp_dbl = 1.372204694e+02   rel_err = 3.563072184e-08

Run Code Online (Sandbox Code Playgroud)

AVX2 算法（x8 并行浮点参数）单线程：1 分 45 秒

AVX2 算法（x8 并行浮点参数）4 线程：28 秒

AVX2 算法（x8 并行浮点参数）8 线程：27 秒

相对误差**：

i           x                     y = exp256_ps(x)      double precision exp        relative error

i = 0       x =-5.000000000e+00   y = 6.737946998e-03   exp_dbl = 6.737946999e-03   rel_err =-1.124224480e-10
i = 124000  x =-3.758316040e+00   y = 2.332298271e-02   exp_dbl = 2.332298229e-02   rel_err = 1.803005727e-08
i = 248000  x =-2.516632080e+00   y = 8.073104918e-02   exp_dbl = 8.073104510e-02   rel_err = 5.057888540e-08
i = 372000  x =-1.279417157e+00   y = 2.781994045e-01   exp_dbl = 2.781993997e-01   rel_err = 1.705288467e-08
i = 496000  x =-3.954863176e-02   y = 9.612231851e-01   exp_dbl = 9.612232069e-01   rel_err =-2.269774967e-08
i = 620000  x = 1.199879169e+00   y = 3.319715738e+00   exp_dbl = 3.319715775e+00   rel_err =-1.119642824e-08
i = 744000  x = 2.440370798e+00   y = 1.147729492e+01   exp_dbl = 1.147729571e+01   rel_err =-6.896860199e-08
i = 868000  x = 3.681602478e+00   y = 3.970997620e+01   exp_dbl = 3.970997748e+01   rel_err =-3.232306688e-08
i = 992000  x = 4.923286438e+00   y = 1.374535980e+02   exp_dbl = 1.374536045e+02   rel_err =-4.676466368e-08

Run Code Online (Sandbox Code Playgroud)

**相对错误是相同的，因为SSE2和AVX2的代码使用相同的算法，并且很可能也是库函数exp(x)的错误。

AVX2算法多线程源码

/* Performance test of a multithreaded exp(x) algorithm

based on AVX implementation of Giovanni Garberoglio


 Copyright (C) 2020 Antonio R.


     AVX implementation of exp:
     Modified code from this source: https://github.com/reyoung/avx_mathfun
     Based on "sse_mathfun.h", by Julien Pommier
     http://gruntthepeon.free.fr/ssemath/
     Copyright (C) 2012 Giovanni Garberoglio
     Interdisciplinary Laboratory for Computational Science (LISC)
     Fondazione Bruno Kessler and University of Trento
     via Sommarive, 18
     I-38123 Trento (Italy)


    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.
    Permission is granted to anyone to use this software for any purpose,
    including commercial applications, and to alter it and redistribute it
    freely, subject to the following restrictions:
    1. The origin of this software must not be misrepresented; you must not
       claim that you wrote the original software. If you use this software
       in a product, an acknowledgment in the product documentation would be
       appreciated but is not required.
    2. Altered source versions must be plainly marked as such, and must not be
       misrepresented as being the original software.
    3. This notice may not be removed or altered from any source distribution.
    (this is the zlib license)

  */

  /*    gcc -O3 -m64 -mavx2 -march=haswell expc_multi.c -pthread -lm    */

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#include <math.h>
#include <pthread.h>
#define MAXSIZE 1000000
#define REIT 100000
#define XXX -5
#define num_threads 4

typedef float  FLOAT32[MAXSIZE] __attribute__((aligned(4)));
static FLOAT32 xv;
static FLOAT32 yv;



__m256 exp256_ps(__m256 x) {

/*
  To increase the compatibility across different compilers the original code is
  converted to plain AVX2 intrinsics code without ingenious macro's,
  gcc style alignment attributes etc.
  Moreover, the part "express exp(x) as exp(g + n*log(2))" has been significantly simplified.
  This modified code is not thoroughly tested!
*/


__m256   exp_hi        = _mm256_set1_ps(88.3762626647949f);
__m256   exp_lo        = _mm256_set1_ps(-88.3762626647949f);

__m256   cephes_LOG2EF = _mm256_set1_ps(1.44269504088896341f);
__m256   inv_LOG2EF    = _mm256_set1_ps(0.693147180559945f);

__m256   cephes_exp_p0 = _mm256_set1_ps(1.9875691500E-4);
__m256   cephes_exp_p1 = _mm256_set1_ps(1.3981999507E-3);
__m256   cephes_exp_p2 = _mm256_set1_ps(8.3334519073E-3);
__m256   cephes_exp_p3 = _mm256_set1_ps(4.1665795894E-2);
__m256   cephes_exp_p4 = _mm256_set1_ps(1.6666665459E-1);
__m256   cephes_exp_p5 = _mm256_set1_ps(5.0000001201E-1);
__m256   fx;
__m256i  imm0;
__m256   one           = _mm256_set1_ps(1.0f);

        x     = _mm256_min_ps(x, exp_hi);
        x     = _mm256_max_ps(x, exp_lo);

  /* express exp(x) as exp(g + n*log(2)) */
        fx     = _mm256_mul_ps(x, cephes_LOG2EF);
        fx     = _mm256_round_ps(fx, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC);
__m256  z      = _mm256_mul_ps(fx, inv_LOG2EF);
        x      = _mm256_sub_ps(x, z);
        z      = _mm256_mul_ps(x,x);

__m256  y      = cephes_exp_p0;
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p1);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p2);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p3);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p4);
        y      = _mm256_mul_ps(y, x);
        y      = _mm256_add_ps(y, cephes_exp_p5);
        y      = _mm256_mul_ps(y, z);
        y      = _mm256_add_ps(y, x);
        y      = _mm256_add_ps(y, one);

  /* build 2^n */
        imm0   = _mm256_cvttps_epi32(fx);
        imm0   = _mm256_add_epi32(imm0, _mm256_set1_epi32(0x7f));
        imm0   = _mm256_slli_epi32(imm0, 23);
__m256  pow2n  = _mm256_castsi256_ps(imm0);
        y      = _mm256_mul_ps(y, pow2n);
        return y;
}


void* run(void *received_Val){
    int single_val = *((int *) received_Val);
    int r;
    int i;
    float p;
    float *xp;
    float *yp;

  for (r = 0; r < REIT; r++) {
    p = XXX + 0.00001*single_val*MAXSIZE/num_threads;
    xp = xv + single_val*MAXSIZE/num_threads;
    yp = yv + single_val*MAXSIZE/num_threads;

    for (i = single_val*MAXSIZE/num_threads; i < (single_val+1)*MAXSIZE/num_threads; i += 8) {

      __m256 x = _mm256_setr_ps(p, p + 0.00001, p + 0.00002, p + 0.00003, p + 0.00004, p + 0.00005, p + 0.00006, p + 0.00007);
      __m256 y = exp256_ps(x);

      _mm256_store_ps(xp,x);
      _mm256_store_ps(yp,y);

      xp += 8;
      yp += 8;
      p += 0.00008;

    }
  }

    return 0;
}


int main(){
    int i;
      pthread_t tid[num_threads];


    for (i=0;i<num_threads;i++){
      int *arg = malloc(sizeof(*arg));
      if ( arg == NULL ) {
        fprintf(stderr, "Couldn't allocate memory for thread arg.\n");
        exit(1);
      }

      *arg = i;
        pthread_create(&(tid[i]), NULL, run, arg);
    }

    for(i=0; i<num_threads; i++)
    {
        pthread_join(tid[i], NULL);
    }

    for (i=0;i<MAXSIZE;i++){
        printf("x=%.20f, e^x=%.20f\n",xv[i],yv[i]);
    }

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

图表概览： exp (double arg) 没有 printf 循环，不是真正的性能，正如 Andreas Wenzel 发现的那样，当结果不是 printf 时，gcc 不会计算 exp(x)，即使 float 版本也慢得多，因为它的汇编指令不同。尽管图对于某些仅使用低级 CPU 缓存/寄存器的汇编算法可能很有用。 expf (float arg) 真实性能或使用 printf 循环 AVX2算法，性能最好。

Answer 1

And*_*zel 5

当您不在程序末尾打印数组时，gcc 编译器足够聪明，可以意识到计算结果没有可观察到的影响。因此，编译器决定没有必要将结果实际写入数组，因为这些结果从未被使用。相反，这些写入会被编译器优化掉。

此外，当您不打印结果时，库函数exp不会产生任何可观察到的影响，前提是调用它的输入不高到会导致浮点溢出（这会导致函数引发浮点）例外）。这也允许编译器优化这些函数调用。

正如您在 gcc 编译器为您的代码发出的汇编指令中所看到的那样，该指令不会打印结果，编译后的程序不会无条件调用该函数exp，而是测试该函数的输入exp是否高于7.09e2（以确保不会发生溢出）。只有发生溢出时，程序才会跳转到调用该函数的代码处exp。这是相关的汇编代码：

ucomisd xmm1, xmm3
jnb .L9

Run Code Online (Sandbox Code Playgroud)

在上面的汇编代码中，CPU寄存器xmm3包含双精度浮点值7.09e2。如果输入高于该常数，该函数exp将导致浮点溢出，因为结果无法用双精度浮点值表示。

由于您的输入始终有效且足够低，不会导致浮点溢出，因此您的程序将永远不会执行此跳转，因此它永远不会真正调用该函数exp。

这解释了为什么当您不打印结果时您的代码会快得多。如果您不打印结果，编译器将确定计算没有可观察到的效果，因此它将优化它们。

因此，如果你想让编译器真正执行计算，你必须确保计算有一些可观察到的效果。这并不意味着您必须实际打印所有结果（有几兆字节大）。如果您只打印一行取决于所有结果（例如所有结果的总和）就足够了。

但是，如果将对库函数的函数调用替换exp为对其他自定义函数的调用，那么，至少在我的测试中，编译器不够聪明，无法意识到函数调用没有可观察到的效果。在这种情况下，即使不打印计算结果，也无法优化函数调用。

由于上述原因，如果您想比较两个函数的性能，则必须确保计算实际进行，并确保结果具有可观察到的效果。否则，您将面临编译器优化至少部分计算的风险，并且比较将不公平。

归档时间：	5 年，3 月前
查看次数：	790 次
最近记录：	4 年，2 月前