为什么通过std :: vector进行迭代比通过std :: array进行迭代要快？

Question

为什么通过std :: vector进行迭代比通过std :: array进行迭代要快？

tuk*_*ket 1 c++ performance benchmarking

我最近问了一个问题：为什么迭代std :: array比迭代std :: vector快得多？

正如人们很快指出的那样，我的基准测试存在许多缺陷。因此，当我尝试确定基准时，我注意到这std::vector并不慢std::array，实际上，情况恰恰相反。

#include <vector>
#include <array>
#include <stdio.h>
#include <chrono>

using namespace std;

constexpr int n = 100'000'000;
vector<int> v(n);
//array<int, n> v;

int main()
{
    int res = 0;
    auto start = chrono::steady_clock::now();
    for(int x : v)
        res += x;
    auto end = chrono::steady_clock::now();
    auto diff = end - start;
    double elapsed =
        std::chrono::duration_cast<
            std::chrono::duration<double, std::milli>
        >(end - start).count();
    printf("result: %d\ntime: %f\n", res, elapsed);
}

Run Code Online (Sandbox Code Playgroud)

我尝试从以前的基准进行改进的事情：

确保使用了结果，因此整个循环没有得到优化
使用-O3标志速度
使用std::chrono代替time命令。这样一来，我们就可以隔离要测量的部分（只是for循环）。变量的静态初始化以及类似的事情将无法衡量。

实测时间：

数组：

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 99.554109

Run Code Online (Sandbox Code Playgroud)

向量：

$ g++ arrVsVec.cpp -O3
$ ./a.out
result: 0
time: 30.734491

Run Code Online (Sandbox Code Playgroud)

我只是想知道这次我在做什么错。

观看Godbolt中的拆卸

Answer 1

Pet*_*des 6

内存映射/分配是惰性的：第一次访问页面将导致页面错误异常（#PF在 x86 上）。这包括 BSS 以及文件支持的映射，例如可执行文件的文本段。这些页面错误是“有效的”，因此它们不会导致传递 SIGSEGV；相反，如果需要，内核会分配一个物理页，并连接硬件页表，以便加载或存储可以重新运行，而不会第二次出错。

这是昂贵的，特别是如果内核在一个页面错误期间不“故障绕过”并准备多个页面。（特别是启用了 Spectre + Meltdown 缓解措施后，在当前的 x86-64 硬件上，用户 <-> 内核往返的成本会更高。）

您让的构造函数在动态分配¹std:vector之后将零写入数组。 在定时循环之外执行所有页面错误。 这发生在 main 之前，而实现正在运行静态对象的构造函数。^std::vector

但该数组是零初始化的，因此它被放置在 BSS 中。首先接触它的是你的循环。 您的array<>循环会补偿定时区域内的所有页面错误。

如果您过去new int[n]动态分配但不初始化内存块，您会看到与 static 相同的行为array<>。（如果 Linux 更愿意使用透明大页来进行动态分配而不是 BSS 映射，也许会更好一些。）

libstdc++ 和 libc++ 中的脚注 1 std::vector太愚蠢了，无法利用从操作系统获取已清零的页面，就像它使用calloc或等效的那样。如果库为归零内存提供一个new/delete兼容的分配器，则这是可能的。

C++ new/delete与 malloc/free/calloc/realloc 相比是残缺的。我不知道为什么 ISO C++ 省略了 calloc 和 realloc：两者对于大型分配都非常有用，尤其是 realloc 用于调整可简单复制对象的 std::vector 的大小，这些对象可能有空间在不复制的情况下增长其映射。但由于new/不能保证与/delete兼容，并且是可替换的，因此库不能很容易使用，甚至在幕后也是如此。mallocfreenewcallocrealloc

另一个因素：只读将 CoW 页映射到相同的物理零页

当读取（而不是写入）触发延迟分配时，它读取为零。（BSS 页面读取为零，新页面读取mmap(MAP_ANONYMOUS)为全零。）

连接硬件页表的（软）页错误处理程序不需要实际分配物理页（也称为页框）来支持该虚拟页。相反，Linux 将干净的（未写入的）匿名页面映射到单个物理清零页面。（这适用于所有任务。）

如果我们对数组进行多次传递，这会导致奇怪的情况，我们可以得到 TLB 未命中但 L1d 或 L3 命中（取决于是否存在大页），因为我们有多个虚拟页指向同一物理位置。

（一些CPU，例如AMD Ryzen，在L1d缓存中使用微标记来保存，其代价是即使相同的内存映射到多个虚拟地址，缓存也只能命中一个虚拟地址。Intel CPU使用true VIPT L1d缓存确实可以得到这个效果），

我为 Linux 制作了一个测试程序，它将使用madvise(MADV_HUGEPAGE)（鼓励内核对大页进行碎片整理）或madvise(MADV_NOHUGEPAGE)（即使对于只读情况也禁用大页）。

由于某些原因，Linux BSS 页面在写入时不使用大页面。仅读取它们确实会使用 2M 大页（对于 L1d 或 L2 来说太大，但适合 L3。但我们确实获得了所有 TLB 命中）。很难看到这一点，/proc/PID/smaps因为未写入的记忆根本不显示为“常驻”。（请记住，它在物理上由系统范围的共享零区域支持）。

我对基准代码进行了一些更改，以便根据命令行参数在读取或写入数组的init 传递之后多次重新运行求和循环。重复循环使其运行时间更长，因此我们可以获得更精确的计时，并分摊 init，以便我们从 perf 中获得有用的结果。

#include <vector> #include <array> #include <stdio.h> #include <chrono> #include <sys/mman.h> using namespace std; constexpr int n = 100'000'000; //vector<int> v(n); alignas(4096) array<int, n> v; //template<class T> __attribute__((noinline)) int toucharray(volatile int *vv, int write_init) { int res=vv[0]; for(int i=32 ; i<n ; i+=128) if(write_init) vv[i] = 0; else res += vv[i]; // volatile int sum = res; // noinline is fine, we don't need to stop multiple calls from CSEing return res; } template <class T> __attribute__((noinline,noclone)) int sum_container(T &vv) { unsigned int res=0; for(int x : vv) res += x; __attribute__((used)) static volatile int sink; sink = res; // a side-effect stops IPA from deciding that this is a pure function return res; } int main(int argc, char**argv) { int write_init = 0; int hugepage = 0; if (argc>1) { hugepage = argv[1][0] & 1; write_init = argv[1][0] & 2; } int repcount = 1000; if (argc>2) repcount = atoi(argv[2]); // TODO: option for no madvise. madvise(v.data(), n*sizeof(v[0]), MADV_SEQUENTIAL); madvise(v.data(), n*sizeof(v[0]), hugepage ? MADV_HUGEPAGE : MADV_NOHUGEPAGE); madvise(v.data(), n*sizeof(v[0]), MADV_WILLNEED); // SEQ and WILLNEED probably only matter for file-backed mappings to reduce hard page faults. // Probably not encouraging faultahead / around for lazy-allocation soft page fault toucharray(v.data(), write_init); int res = 0; auto start = chrono::steady_clock::now(); for(int i=0; i<repcount ; i++) res = sum_container(v); auto end = chrono::steady_clock::now(); double elapsed = std::chrono::duration_cast< std::chrono::duration<double, std::milli> >(end - start).count(); printf("result: %d\ntime: %f\n", res, elapsed); }
Run Code Online (Sandbox Code Playgroud)
最好的情况： clang++ -O3 -march=native (skylake) 实际上使用多个累加器展开，与 gcc -funroll-loops 不同，它的工作很愚蠢。

在我的配备 DDR4-2666 DRAM 的 Skylake i7-6700k 上，配置为 4.2GHz 最大睿频和调速器=性能 -

# using std::array<int,n> # 0&1 = 0 -> MADV_NOHUGEPAGE. 0&2 = 0 -> read-only init taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 0 1000 result: 0 time: 1961.952394 Performance counter stats for './touchpage-array-madv-nohuge-argc.clang 0 1000': 2,017.34 msec task-clock:u # 1.000 CPUs utilized 50 context-switches # 0.025 K/sec 0 cpu-migrations # 0.000 K/sec 97,774 page-faults # 0.048 M/sec 8,287,680,837 cycles # 4.108 GHz 14,500,762,859 instructions # 1.75 insn per cycle 13,688 mem_load_retired.l2_hit:u # 0.007 M/sec 12,501,329,912 mem_load_retired.l1_hit:u # 6196.927 M/sec 144,559 mem_inst_retired.stlb_miss_loads:u # 0.072 M/sec 2.017765632 seconds time elapsed 1.979410000 seconds user 0.036659000 seconds sys
Run Code Online (Sandbox Code Playgroud)
请注意相当多的 TLB 未命中（mem_inst_retired.stlb_miss_loads:u计算用户空间中的第二级 TLB 未命中）。还有 97k 页面错误。这几乎与覆盖 100M * 4 = 400MB 阵列所需的 4k 页面一样多，因此每页有 1 个错误，没有故障前/故障周围。

幸运的是，Skylake 有两个页面遍历单元，因此它可以并行执行两个推测性页面遍历。此外，所有数据访问都在 L1d 中进行，因此页表至少会在 L2 中保持热状态，从而加快页面遍历速度。

# using array # MADV_HUGEPAGE, read-only init taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 1 1000 result: 0 time: 5947.741408 Performance counter stats for './touchpage-array-argc.clang 1 1000': 5,951.40 msec task-clock:u # 1.000 CPUs utilized 9 context-switches # 0.002 K/sec 0 cpu-migrations # 0.000 K/sec 687 page-faults # 0.115 K/sec 24,377,094,416 cycles # 4.096 GHz 14,397,054,228 instructions # 0.59 insn per cycle 2,183,878,846 mem_load_retired.l2_hit:u # 366.952 M/sec 313,684,419 mem_load_retired.l1_hit:u # 52.708 M/sec 13,218 mem_inst_retired.stlb_miss_loads:u # 0.002 M/sec 5.951530513 seconds time elapsed 5.944087000 seconds user 0.003284000 seconds sys
Run Code Online (Sandbox Code Playgroud)
请注意，大约 1/10 的 TLB 未命中，但是在相同的大约 12G 内存负载中，只有 2G 在 L2 中命中，这可能要归功于成功的硬件预取。（不过其余的确实在 L3 中命中。）而且我们只有 687 个页面错误；故障处理和大页面的结合使这变得更加高效。

请注意，由于 L3 带宽的瓶颈，所花费的时间增加了 3 倍。

数组的 write-init 给了我们两全其美的结果：

# using array # MADV_HUGEPAGE (no apparent effect on BSS) and write-init taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-array-argc.clang 3 1000 result: 0 time: 16510.222762 Performance counter stats for './touchpage-array-argc.clang 3 1000': 17,143.35 msec task-clock:u # 1.000 CPUs utilized 341 context-switches # 0.020 K/sec 0 cpu-migrations # 0.000 K/sec 95,218 page-faults # 0.006 M/sec 70,475,978,274 cycles # 4.111 GHz 17,989,948,598 instructions # 0.26 insn per cycle 634,015,284 mem_load_retired.l2_hit:u # 36.983 M/sec 107,041,744 mem_load_retired.l1_hit:u # 6.244 M/sec 37,715,860 mem_inst_retired.stlb_miss_loads:u # 2.200 M/sec 17.147615898 seconds time elapsed 16.494211000 seconds user 0.625193000 seconds sys
Run Code Online (Sandbox Code Playgroud)
很多页面错误。TLB 未命中的次数也多得多。

std::vector 版本与 array 基本相同：

strace显示 madvise 不起作用，因为我没有对齐指针。glibc / libstdc++new倾向于返回一个页对齐 + 16 的指针，分配器簿记在前 16 个字节中。对于数组，我曾经alignas(4096)确保可以将其传递给 madvise。

madvise(0x7f760d133010, 400000000, MADV_HUGEPAGE) = -1 EINVAL (Invalid argument)

所以无论如何，通过我的内核调整设置，它只会尝试对 madvise 上的大页面进行内存碎片整理，而内存是相当碎片化的 ATM。所以它最终没有使用任何大页面。

taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles,instructions,mem_load_retired.l2_hit:u,mem_load_retired.l1_hit:u,mem_inst_retired.stlb_miss_loads:u ./touchpage-vector-argv.clang 3 1000 result: 0 time: 16020.821517 Performance counter stats for './touchpage-vector-argv.clang 3 1000': 16,159.19 msec task-clock:u # 1.000 CPUs utilized 17 context-switches # 0.001 K/sec 0 cpu-migrations # 0.000 K/sec 97,771 page-faults # 0.006 M/sec 66,146,780,261 cycles # 4.093 GHz 15,294,999,994 instructions # 0.23 insn per cycle 217,426,277 mem_load_retired.l2_hit:u # 13.455 M/sec 842,878,166 mem_load_retired.l1_hit:u # 52.161 M/sec 1,788,935 mem_inst_retired.stlb_miss_loads:u # 0.111 M/sec 16.160982779 seconds time elapsed 16.017206000 seconds user 0.119618000 seconds sys
Run Code Online (Sandbox Code Playgroud)
我不确定为什么 TLB 未命中率比 THP 只读测试高得多。也许通过接触更多内存来争用内存访问和/或逐出缓存页表最终会减慢页面行走速度，因此 TLB 预取无法跟上。

在约 12G 的负载中，硬件预取能够使大约 1G 的负载命中 L1d 或 L2 缓存。

Answer 2

Max*_*kin 5

差异是由于内存页array未驻留在进程地址空间中（全局作用域数组存储在.bss尚未分页的可执行文件的节中，它被零初始化）。鉴于vector刚刚已分配并填充零，因此其内存页面已经存在。

如果添加

std::fill_n(v.data(), n, 1);

Run Code Online (Sandbox Code Playgroud)

作为main将页面带入（故障前）的第一行，这使得array时间与相同vector。

在Linux上，您可以mlock(v.data(), v.size() * sizeof(v[0]));改为将页面放入地址空间。有关man mlock完整的详细信息，请参见。

@PeterCordes-这是实现的权衡。快速，低延迟，低功耗的VIPT L1高速缓存并不容易，因此AMD使用了称为“微标签”的东西。这样，您就可以仅使用V地址进行L1查找，从而将TLB移出关键路径。缺点是虚拟别名无法解析，因此您一次只能在高速缓存中拥有给定P地址的一个V别名。 (2认同)

归档时间：	6 年，3 月前
查看次数：	156 次
最近记录：	6 年，3 月前