是什么让 numpy.sum 比优化（自动向量化）C 循环更快？

Question

是什么让 numpy.sum 比优化（自动向量化）C 循环更快？

dna*_*lor 1 c floating-point numpy avx compiler-optimization

我正在尝试编写一个与numpy.sum双精度数组一样快的 C 程序，但似乎失败了。

以下是我衡量 numpy 性能的方法：

import numpy as np
import time

SIZE=4000000
REPS=5

xs = np.random.rand(SIZE)
print(xs.dtype)

for _ in range(REPS):
    start = time.perf_counter()
    r = np.sum(xs)
    end = time.perf_counter()
    print(f"{SIZE / (end-start) / 10**6:.2f} MFLOPS ({r:.2f})")

Run Code Online (Sandbox Code Playgroud)

输出是：

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)

Run Code Online (Sandbox Code Playgroud)

现在尝试在 C 中做类似的事情：

float64
2941.61 MFLOPS (2000279.78)
3083.56 MFLOPS (2000279.78)
3406.18 MFLOPS (2000279.78)
3712.33 MFLOPS (2000279.78)
3661.15 MFLOPS (2000279.78)

Run Code Online (Sandbox Code Playgroud)

编译并gcc -o main -Wall -O3 -mavx main.c运行它，输出是：

1850.14 MFLOPS (1999882.86)
1857.01 MFLOPS (1999882.86)
1900.24 MFLOPS (1999882.86)
1903.86 MFLOPS (1999882.86)
1906.58 MFLOPS (1999882.86)

Run Code Online (Sandbox Code Playgroud)

显然这比 numpy 慢。

根据toppython 进程的 CPU 使用率约为 100%，因此 numpy 看起来并没有并行化任何东西。

C 代码似乎使用 256 位 AVX 寄存器（编译时xmm0 上-S有vaddsd指令）。这似乎是最好的选择，因为我使用的机器似乎不支持 AVX-512：

$ egrep 'model name|flags' /proc/cpuinfo  | head -n2
model name      : 13th Gen Intel(R) Core(TM) i9-13900K
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities

Run Code Online (Sandbox Code Playgroud)

numpy 使用了什么样的技巧来击败这段 C 代码？

Answer 1

Pet*_*des 6

您的循环根本没有优化，因为严格的 FP 数学是默认的。XMM0是128位寄存器，YMM0是对应的256位。vaddsd是 ADD Scalar Double，使用 XMM0 的低位元素。https://felixcloutier.com/x86/addsd

\n

让它clang -O3 -ffast-math -march=native矢量化并展开（4 倍）以获得 16 倍的加速，AVX 和指令级并行性各提高 4 倍（维基百科/现代微处理器\n90 分钟指南！），数组足够小，不会成为瓶颈L3 缓存带宽。（另一个大约 2 倍的性能可通过适用于 L1d（而不仅仅是 L2）的数组获得，例如，可以#pragma clang loop interleave_count(8)展开更多，对于已缓存阻止的代码，通常会在 L1d 缓存中命中。）

\n

您的 Raptor Lake CPU 有两个全流水线矢量 FP 添加单元，流水线长度为 3（结果准备好作为另一个添加的输入之前的延迟周期）。这个答案包括我在 i7-6700k Skylake 上的结果，除了 FP-add 管道有 4 个周期延迟之外，它具有相同的结果。

\n

@J\xc3\xa9r\xc3\xb4me Richard 评论说 NumPy 只是对 FP 数组的和进行标量成对求和，这比纯朴素的串行获得了一些 ILP。如果您遇到 DRAM 带宽瓶颈，也没关系。一个好处是 ISA 和可用 SIMD 功能之间的数值一致性，这是通过不使用它们来实现的。

\n

您正在寻找vaddpd ymm0, ymm0, [rdi]（256 位向量上的 Packed Double）。GCC 将执行此操作-ffast-math，其中之一是让它假装 FP 数学运算具有关联性，从而改变舍入误差。（在这种情况下，为了更好，例如，如果您与long double总和或 Kahan 误差补偿求和进行比较；这是朝着与成对求和相同的想法迈出的一步。）另请参阅https://gcc.gnu.org/维基/浮点数学

\n

gcc -O3 -march=native -ffast-math  foo.c\n

Run Code Online (Sandbox Code Playgroud)\n

这提供了大约 4 倍的加速，因为 FP ALU 延迟（CPU 上每 3 个周期 1 个向量而不是 1 个标量）仍然是比 L3 带宽更糟糕的瓶颈，绝对比 L2 缓存带宽更糟糕。

\n

SIZE=4000000次sizeof(double)为 30.52 MiB，因此它适合您的高端 Raptor Lake 的 36MiB L3 缓存。但要跑得更快，您将需要减少SIZE并加速REPS（也许在定时区域内放置一个重复循环。）整个程序的分析时间非常短perf stat，在我的 i7-6700k 上不到 100 毫秒Skylake 配备 DDR4-2666，其中大部分是启动阶段。clock()用代替的时间也很短clock_gettime。

\n

您的每核缓存大小为 48 KiB L1d、2 MiB L2（在 Golden Cove P 核上，在单个 Gracemont E 核上更少/更多）。 https://en.wikipedia.org/wiki/Raptor_Lake / https://chipsandcheese.com/2022/08/23/a-preview-of-raptor-lakes-improved-l2-caches/。 SIZE=6144将使数组达到 L1d 的完整大小。如果我们的目标是仅阵列占用 40 KiB 的空间，为其他一些东西留出空间，SIZE=5120. 不妨将其按 32 字节对齐，aligned_alloc这样我们就可以在每个时钟周期以 3 个向量（96 字节）从 L1d 缓存中读取它，而不是让缓存行分割每个其他向量。（https://chipsandcheese.com/2021/12/02/popping-the-hood-on-golden-cove//https://travisdowns.github.io/blog/2019/06/11/speed-limits。html#load-throughput-limit / https://uops.info/）

\n

为了达到接近峰值 FLOPS（由于我们不使用 FMA，所以在 2 倍以内），我们需要vaddpd在每个时钟周期运行 2 条指令。但它在 Golden Cove P 核心（Alder/Raptor Lake）上有 3 个周期的延迟，因此该产品同时运行latency * bandwidth6 个周期。vaddpd这是依赖链的最小数量，最好至少为 8 个。如果少于 8 个，循环携带的依赖链就会成为瓶颈，而不是吞吐量。\n( 为什么mulss 在 Haswell 上只需要 3 个周期，与 Agner 的指令表不同?（使用多个累加器展开 FP 循环））

\n

因此，您正在内部循环中寻找其他指令，例如vaddpd ymm1, ymm1, [rdi+32]. Golden Cove 的 3c 延迟/0.5c 倒数吞吐量vaddps/pd是由于专用 SIMD-FP 添加 ALU 而不是用于 mul/FMA 执行单元的 4 周期管道，后者自 Skylake 以来也用于添加/子。与 Haswell 不同，Golden Cove（Alder/Raptor Lake P-core）有两个这样的 ALU，因此吞吐量仍然与 FMA 一样好。

\n

GCC-funroll-loops在这里没有用处，展开循环但仍然只有一个累加器向量。（即使使用#pragma omp simd reduction (+:sum)和-fopenmp。） 默认情况下，Clang 将使用 4 个累加器展开。 它将-march=raptorlake展开 20，但仍然只有 4 个累加器，因此每个向量中添加 5 个。并且使用像这样的索引寻址模式，[rdi + 8*rcx + 32]每个vaddpd ymm, ymm, [reg+reg*8]都未层压到 2 uop，并没有尽可能地降低前端成本。仅涉及一个数组，因此使用指针增量而不是索引甚至不会有任何额外成本，它没有做任何聪明的事情，例如相对于具有负索引的数组末尾进行索引计数到零。但这并不是瓶颈。Golden Cove 的宽前端（6 uops）vaddpd [mem+idx]每个周期可以发出 3 条这样的指令，因此保持领先于后端（2/时钟）。即使是 4 宽 Skylake 也能跟上这种有限的展开。

\n

#pragma clang loop interleave_count(8)for()在用更多的累加器展开之前。（超过8它会忽略它，只执行 4 :/）对于您期望获得 L1d 命中的代码来说，这可能只是一个好主意；如果您预计阵列需要来自 L2 或更远的位置，则默认值很好。当然，在这种情况下，展开的非交错部分也只是浪费代码大小，并且如果n不是编译时常量，则会花费更多的清理代码。文档： https: //clang.llvm.org/docs/LanguageExtensions.html#extensions-for-loop-hint-optimizations

\n

使用默认值，没有编译指示，但是-O3 -ffast-math -march=native（在 Skylake 上也使用-mbranches-within-32B-boundaries），我们得到与 Clang 在 Raptor Lake 上使用的 4 个累加器交错相同的 20 展开。（它还完全展开REPS计时/打印循环，因此这个大循环重复 5 次。这几乎肯定比花费 1 个寄存器和几条指令来回收缓存中已经很热的代码更糟糕。）

\n

gcc -O3 -march=native -ffast-math  foo.c\n

Run Code Online (Sandbox Code Playgroud)\n

与使用 pragma 相比，当内联到时，展开 16，有 8 个累加器main。 4000不完全是 16 x 4 的倍数，因此循环退出条件位于循环中间的 8 个加法组之间。

\n

# clang 16  no pragma, unrolls by 20 with 4 accumulators\ninner_loop_top:\n    1360:       c5 fd 58 84 cb a0 fd ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x260]\n    1369:       c5 f5 58 8c cb c0 fd ff ff      vaddpd ymm1,ymm1,[rbx+rcx*8-0x240]\n    1372:       c5 ed 58 94 cb e0 fd ff ff      vaddpd ymm2,ymm2, [rbx+rcx*8-0x220]\n    137b:       c5 e5 58 9c cb 00 fe ff ff      vaddpd ymm3,ymm3, [rbx+rcx*8-0x200]\n    1384:       c5 fd 58 84 cb 20 fe ff ff      vaddpd ymm0,ymm0, [rbx+rcx*8-0x1e0]\n  ... ymm1, ymm2\n    139f:       c5 e5 58 9c cb 80 fe ff ff      vaddpd ymm3,ymm3,[rbx+rcx*8-0x180]\n\n... 2 more copies of ymm0..3, ending with the next insn, the first to use a 1-byte disp8\n    13e7:       c5 e5 58 5c cb 80       vaddpd ymm3,ymm3, [rbx+rcx*8-0x80]\n\n    13ed:       c5 fd 58 44 cb a0       vaddpd ymm0,ymm0, [rbx+rcx*8-0x60]\n    13f3:       c5 f5 58 4c cb c0       vaddpd ymm1,ymm1, [rbx+rcx*8-0x40]\n    13f9:       c5 ed 58 54 cb e0       vaddpd ymm2,ymm2, [rbx+rcx*8-0x20]\n    13ff:       c5 e5 58 1c cb          vaddpd ymm3,ymm3, [rbx+rcx*8]\n    1404:       48 83 c1 50             add    rcx,0x50\n    1408:       48 81 f9 ec 0f 00 00    cmp    rcx,0xfec\n    140f:       0f 85 4b ff ff ff       jne    1360 <main+0x80>\n

Run Code Online (Sandbox Code Playgroud)\n

我尝试更改源代码以鼓励编译器增加指针，但 clang 没有接受提示，在它使用的寄存器中发明了一个索引计数器 [rdi + r8*8 + 0x20]

\n

  const double * endp = array+size;\n#pragma clang loop interleave_count(8)\n  while (array != endp) {  // like a C++ range-for\n    sum += *array++;       // no benefit, clang pessimizes back to an index\n  }\n

Run Code Online (Sandbox Code Playgroud)\n

\n

更新了微基准源代码

\n

// #define SIZE 5120 // 40 KiB, fits in Raptor Lake\'s 48KiB\n#define SIZE 4000     // fits in SKL\'s 32KiB L1d cache\n#define REPS 5\n\n...\n\n        double *array = aligned_alloc(32, array_size * sizeof(double));\n//  double *array = malloc(array_size * sizeof(double));\n\n...\n\ndouble sum_array(const double *array, long size) {\n  double sum = 0.0;\n//#pragma clang loop interleave_count(8)   // uncomment this, optionally\n  for (size_t i = 0; i < size; ++i) {\n    sum += array[i];\n  }\n  return sum;\n}\n\n\nint main() {\n  double *xs = make_random_array(SIZE);\n  if (xs == NULL) return 1;\n\n  const int  inner_reps = 1000000;  // sum the array this many times each timed interval\n  for (int i = 0; i < REPS; i++) {\n    clock_t start_time = clock();\n    volatile double r;  // do something with the sum even when we don\'t print\n    for (int i=0 ; i<inner_reps ; i++){  // new inner loop\n       r = sum_array(xs, SIZE);\n       //  asm(""::"r"(xs) :"memory");  // forget about the array contents and redo the sum\n       // turned out not to be necessary, clang is still doing the work\n    }\n    clock_t end_time = clock();\n    double dt = (double)(end_time - start_time) / (CLOCKS_PER_SEC * inner_reps);\n    printf("%.2f MFLOPS (%.2f)\\n", (double)SIZE / dt / 1000000, r);\n  }\n\n  free(xs);\n  return 0;\n}\n

Run Code Online (Sandbox Code Playgroud)\n

添加const int inner_reps = 1000000;了每个定时间隔内的重复计数和一些确保优化器不会击败它的措施（Godbolt - 也SIZE减少到4000适合我的 32KiB L1d），在 4.2 GHz 的 Skylake 上，我得到了如预期加速 16 倍。

\n

Arch GNU/Linux 上的 GCC 13.2.1、clang 16.0.6、内核 6.5

\n

# Without any vectorization\n$ gcc -O3 -march=native -Wall arr-sum.c\ntaskset -c 1 perf stat  -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,fp_arith_inst_retired.256b_packed_single   -r1 ./a.out 1057.69 MFLOPS (2003.09)\n1059.17 MFLOPS (2003.09)\n1059.67 MFLOPS (2003.09)\n1060.30 MFLOPS (2003.09)\n1060.34 MFLOPS (2003.09)\n... perf results below\n\n# with 1 vector accumulator\n$ gcc -O3 -march=native -ffast-math -Wall arr-sum.c\n$ taskset -c 1 perf stat ... a.out\n4389.68 MFLOPS (2003.09)\n4389.32 MFLOPS (2003.09)\n4381.48 MFLOPS (2003.09)\n4393.57 MFLOPS (2003.09)\n4389.98 MFLOPS (2003.09)\n... perf results below\n\n# unrolled by 4 vectors\n$ clang -O3 -march=native -ffast-math -Wall arr-sum.c   # clang unrolls by default\n$ taskset -c 1 perf stat ... a.out\n17048.41 MFLOPS (2003.09)\n17072.49 MFLOPS (2003.09)\n17060.55 MFLOPS (2003.09)\n17081.02 MFLOPS (2003.09)\n17099.79 MFLOPS (2003.09)\n... perf results below, but including:\n     2,303,995,395      idq.mite_uops                    #    1.965 G/sec                     \n  # suffering from the JCC erratum in the inner loop; avoid it:\n\n$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c\n$ taskset -c 1 perf stat ... a.out\n17013.53 MFLOPS (2003.09)\n17061.79 MFLOPS (2003.09)\n17064.99 MFLOPS (2003.09)\n17109.44 MFLOPS (2003.09)\n17001.74 MFLOPS (2003.09)\n... perf results below; summary: 1.17 seconds\n     4,905,130,231      cycles                           #    4.178 GHz                       \n     5,941,872,098      instructions                     #    1.21  insn per cycle\n         5,165,165      idq.mite_uops                    #    4.399 M/sec\n     5,015,000,000      fp_arith_inst_retired.256b_packed_double #    4.271 G/sec\n\n # With  #pragma clang loop interleave_count(8) in the source\n # for unrolling by 8 instead of 4\n$ clang -O3 -march=native -mbranches-within-32B-boundaries -ffast-math -Wall arr-sum.c\n$ taskset -c 1 perf stat ... a.out\n28505.05 MFLOPS (2003.09)\n28553.48 MFLOPS (2003.09)\n28556.13 MFLOPS (2003.09)\n28597.37 MFLOPS (2003.09)\n28548.18 MFLOPS (2003.09)\n # imperfect scheduling and a front-end bottleneck from clang\'s bad choice of addressing-mode\n # means we don\'t get another 2x over the default.\n

Run Code Online (Sandbox Code Playgroud)\n

（通过perf stat -d，我还确认 L1d 缓存未命中率低于 1%。使用较大的数组大小（例如20000适合 Skylake 的 256K L2 缓存但不适合 L1d），我仍然相当接近每个时钟吞吐量 1 个向量。）

\n

在这种情况下， JCC勘误解决方法（仅限 Skylake 系列，而不是您的 CPU）提供的进一步加速可以忽略不计，即使使用传统解码，前端也不是瓶颈：出现问题时会发生未层压，因此解码器不会2-uop 指令令人窒息。在 4 倍展开的情况下，平均uops_issued.any吞吐量仍然仅为 2.18 个/时钟。

\n

因此，通过 AVX (4x) 向量化和 4 个累加器的指令级并行性，我们在 Skylake 上获得了 16 倍的加速。这仍然仅比vaddpd每个时钟周期 1 的平均值稍好一点（因为重复循环迭代之间存在 ILP），但 clang 的 4 个 dep 链仅为 Skylake 的 4 个周期延迟 x 2 insn/ 的一半周期吞吐量 = 8 个飞行中的 FP 数学指令。

\n

展开 4 后，桌面上还剩下 2 个性能因子（对于 Skylake，对于 Alder Lake 及更高版本，性能会降低。更新：我们通过获得了大部分性能）pragma。但这只有在 L1d 缓存中的数据很热、小心缓存阻塞的情况下才能实现，或者如果您在寄存器中对数据进行更多处理（更高的计算强度，而不仅仅是每次加载 1 次添加）。获得另一个完整的 2x 还需要一个能够识别 Sandybridge 系列未分层的优化器，而 clang 显然不会。Clang 默认的 4 个累加器似乎是合理的，更多的累加器意味着更多的初始化和清理工作，尽管仅使用 4 个累加器展开 20 个似乎过多，就像浪费 I-cache / uop-cache 占用空间。

\n

性能计数器结果

\n

仅在具有 Linux 内核和 perf 6.5 的 i7-6700k Skylake（EPP=性能）上计算用户空间。这是针对整个过程，包括启动，但是内部重复计数为 100 万意味着其总时间的绝大多数花费在我们关心的循环上，而不是启动。

\n

标量循环：
\n\'./a.out\' 的性能计数器统计信息（GCC O3 原生，无快速数学）：

\n

     18,902.70 msec task-clock                       #    1.000 CPUs utilized\n            54      context-switches                 #    2.857 /sec\n             0      cpu-migrations                   #    0.000 /sec\n            72      page-faults                      #    3.809 /sec\n79,099,401,032      cycles                           #    4.185 GHz\n35,069,666,963      instructions                     #    0.44  insn per cycle\n30,109,096,046      uops_issued.any                  #    1.593 G/sec\n50,096,899,159      uops_executed.thread             #    2.650 G/sec\n    46,353,551      idq.mite_uops                    #    2.452 M/sec\n             0      fp_arith_inst_retired.256b_packed_double #    0.000 /sec\n\n  18.902876984 seconds time elapsed\n\n  18.893778000 seconds user\n   0.000000000 seconds sys\n

Run Code Online (Sandbox Code Playgroud)\n

请注意 0 计数fp_arith_inst_retired.256b_packed_double- 没有 256 位 SIMD 指令。

\n

矢量化但未展开：
\n\'./a.out\' 的性能计数器统计信息（GCC O3-native-fast-math）：

\n

      4,559.54 msec task-clock                       #    1.000 CPUs utilized\n             8      context-switches                 #    1.755 /sec\n             0      cpu-migrations                   #    0.000 /sec\n            74      page-faults                      #   16.230 /sec\n19,093,881,407      cycles                           #    4.188 GHz\n20,060,557,627      instructions                     #    1.05  insn per cycle\n15,094,070,341      uops_issued.any                  #    3.310 G/sec\n20,075,885,996      uops_executed.thread             #    4.403 G/sec\n    12,015,692      idq.mite_uops                    #    2.635 M/sec\n 5,000,000,000      fp_arith_inst_retired.256b_packed_double #    1.097 G/sec\n\n   4.559770793 seconds time elapsed\n\n   4.557838000 seconds user\n   0.000000000 seconds sys\n

Run Code Online (Sandbox Code Playgroud)\n

矢量化，用 4 个累加器按 20 展开：
\n\'./a.out\' 的性能计数器统计信息：(Clang -O3-native-fast-math JCC-workaround)

\n

      1,174.07 msec task-clock                       #    1.000 CPUs utilized\n             5      context-switches                 #    4.259 /sec\n             0      cpu-migrations                   #    0.000 /sec\n            72      page-faults                      #   61.325 /sec\n 4,905,130,231      cycles                           #    4.178 GHz\n 5,941,872,098      instructions                     #    1.21  insn per cycle\n10,689,939,125      uops_issued.any                  #    9.105 G/sec\n10,566,645,887      uops_executed.thread

归档时间：	2 年，4 月前
查看次数：	155 次
最近记录：	2 年，4 月前