为什么push_back比先前分配的向量的operator []慢

xis*_*xis 34 c++ stl vector c++11

我刚刚阅读了这篇博客http://lemire.me/blog/archives/2012/06/20/do-not-waste-time-with-stl-vectors/,比较了operator[]作业的性能和push_back预留的内存std::vector,我决定亲自尝试一下 操作很简单:

// for vector
bigarray.reserve(N);

// START TIME TRACK
for(int k = 0; k < N; ++k)
   // for operator[]:
   // bigarray[k] = k;
   // for push_back
   bigarray.push_back(k);
// END TIME TRACK

// do some dummy operations to prevent compiler optimize
long sum = accumulate(begin(bigarray), end(array),0 0);
Run Code Online (Sandbox Code Playgroud)

这是结果:

 ~/t/benchmark> icc 1.cpp -O3 -std=c++11
 ~/t/benchmark> ./a.out
[               1.cpp:   52]     0.789123s  --> C++ new
[               1.cpp:   52]     0.774049s  --> C++ new
[               1.cpp:   66]     0.351176s  --> vector
[               1.cpp:   80]     1.801294s  --> reserve + push_back
[               1.cpp:   94]     1.753786s  --> reserve + emplace_back
[               1.cpp:  107]     2.815756s  --> no reserve + push_back
 ~/t/benchmark> clang++ 1.cpp -std=c++11 -O3
 ~/t/benchmark> ./a.out
[               1.cpp:   52]     0.592318s  --> C++ new
[               1.cpp:   52]     0.566979s  --> C++ new
[               1.cpp:   66]     0.270363s  --> vector
[               1.cpp:   80]     1.763784s  --> reserve + push_back
[               1.cpp:   94]     1.761879s  --> reserve + emplace_back
[               1.cpp:  107]     2.815596s  --> no reserve + push_back
 ~/t/benchmark> g++ 1.cpp -O3 -std=c++11
 ~/t/benchmark> ./a.out
[               1.cpp:   52]     0.617995s  --> C++ new
[               1.cpp:   52]     0.601746s  --> C++ new
[               1.cpp:   66]     0.270533s  --> vector
[               1.cpp:   80]     1.766538s  --> reserve + push_back
[               1.cpp:   94]     1.998792s  --> reserve + emplace_back
[               1.cpp:  107]     2.815617s  --> no reserve + push_back
Run Code Online (Sandbox Code Playgroud)

对于所有编译器,vectorwith operator[]比原始指针快得多operator[].这导致了第一个问题:为什么?第二个问题是,我已经"保留"了内存,为什么opeator[]更快?

接下来的问题是,由于内存已经分配,​​为什么push_back时间慢于operator[]

测试代码如下:

#include <iostream>
#include <iomanip>
#include <vector>
#include <numeric>
#include <chrono>
#include <string>
#include <cstring>

#define PROFILE(BLOCK, ROUTNAME) ProfilerRun([&](){do {BLOCK;} while(0);}, \
        ROUTNAME, __FILE__, __LINE__);

template <typename T>
void ProfilerRun (T&&  func, const std::string& routine_name = "unknown",
                  const char* file = "unknown", unsigned line = 0)
{
    using std::chrono::duration_cast;
    using std::chrono::microseconds;
    using std::chrono::steady_clock;
    using std::cerr;
    using std::endl;

    steady_clock::time_point t_begin = steady_clock::now();

    // Call the function
    func();

    steady_clock::time_point t_end = steady_clock::now();
    cerr << "[" << std::setw (20)
         << (std::strrchr (file, '/') ?
             std::strrchr (file, '/') + 1 : file)
         << ":" << std::setw (5) << line << "]   "
         << std::setw (10) << std::setprecision (6) << std::fixed
         << static_cast<float> (duration_cast<microseconds>
                                (t_end - t_begin).count()) / 1e6
         << "s  --> " << routine_name << endl;

    cerr.unsetf (std::ios_base::floatfield);
}

using namespace std;

const int N = (1 << 29);

int routine1()
{
    int sum;
    int* bigarray = new int[N];
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray[k] = k;
    }, "C++ new");
    sum = std::accumulate (bigarray, bigarray + N, 0);
    delete [] bigarray;
    return sum;
}

int routine2()
{
    int sum;
    vector<int> bigarray (N);
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray[k] = k;
    }, "vector");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}

int routine3()
{
    int sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "reserve + push_back");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}

int routine4()
{
    int sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray.emplace_back(k);
    }, "reserve + emplace_back");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}

int routine5()
{
    int sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "no reserve + push_back");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}


int main()
{
    long s0 = routine1();
    long s1 = routine1();
    long s2 = routine2();
    long s3 = routine3();
    long s4 = routine4();
    long s5 = routine5();
    return int (s1 + s2);
}
Run Code Online (Sandbox Code Playgroud)

Zac*_*and 44

push_back做边界检查. operator[]才不是.因此,即使您已预留空间,push_back也将进行额外的条件检查operator[].此外,它将增加size值(保留仅设置capacity),因此它将每次更新.

简而言之,push_back正在做的不仅仅是operator[]做什么 - 这就是为什么它更慢(更准确).


dyp*_*dyp 28

正如Yakk和我发现的那样,可能还有另一个有趣的因素导致了明显的缓慢push_back.

第一个有趣的现象是,在原有的测试,使用new和原始阵列上运行的速度较慢比使用vector<int> bigarray(N);operator[]-超过2倍.更有意思的是,你可以通过插入得到两个相同的性能额外 memset的原始数组变体:

int routine1_modified()
{
    int sum;
    int* bigarray = new int[N];

    memset(bigarray, 0, sizeof(int)*N);

    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray[k] = k;
    }, "C++ new");
    sum = std::accumulate (bigarray, bigarray + N, 0);
    delete [] bigarray;
    return sum;
}
Run Code Online (Sandbox Code Playgroud)

当然,结论是PROFILE衡量与预期不同的东西.Yakk和我猜它与内存管理有关; 从Yakk对OP的评论:

resize将触及整个内存块.reserve会毫不动摇地分配.如果你有一个懒惰的分配器,在访问之前不会获取或分配物理内存页面,那么reserve在空向量上几乎是免费的(甚至不必为页面找到物理内存!),直到你写入页面为止(在这一点,他们必须找到).

我想到了类似的东西,所以通过用"strided memset"(一个分析工具可能得到更可靠的结果)触摸某些页面,尝试对这个假设进行小测试:

int routine1_modified2()
{
    int sum;
    int* bigarray = new int[N];

    for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int))
        bigarray[k] = 0;

    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray[k] = k;
    }, "C++ new");
    sum = std::accumulate (bigarray, bigarray + N, 0);
    delete [] bigarray;
    return sum;
}
Run Code Online (Sandbox Code Playgroud)

通过将每一页的步幅从每一页改为每四页完全离开,我们可以很好地将时间从vector<int> bigarray(N);案例转换到new int[N]没有memset使用过的情况.

在我看来,这是一个强烈暗示,内存管理是测量结果的主要贡献者.


另一个问题是分支push_back.据称可以在很多答案,这是一个/主要的原因push_back很多比使用较慢operator[].实际上,将原始指针w/o memset与使用reserve+进行比较push_back,前者的速度要快两倍.

同样,如果我们添加一些UB(但稍后检查结果):

int routine3_modified()
{
    int sum;
    vector<int> bigarray;
    bigarray.reserve (N);

    memset(bigarray.data(), 0, sizeof(int)*N); // technically, it's UB

    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "reserve + push_back");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}
Run Code Online (Sandbox Code Playgroud)

此修改版本比使用new+完整版慢约2倍memset.因此,无论调用是什么push_back,2与仅设置元素(operator[]vector原始数组和原始数组中)相比,它都会导致因子减速.

但它是否需要分支push_back或附加操作?

// pseudo-code
void push_back(T const& p)
{
    if(size() == capacity())
    {
        resize( size() < 10 ? 10 : size()*2 );
    }

    (*this)[size()] = p; // actually using the allocator
    ++m_end;
}
Run Code Online (Sandbox Code Playgroud)

确实很简单,比如libstdc ++的实现.

我已经使用vector<int> bigarray(N);+ operator[]变体测试了它,并插入了一个模仿以下行为的函数调用push_back:

unsigned x = 0;
void silly_branch(int k)
{
    if(k == x)
    {
        x = x < 10 ? 10 : x*2;
    }
}

int routine2_modified()
{
    int sum;
    vector<int> bigarray (N);
    PROFILE (
    {
        for (unsigned int k = 0; k < N; ++k)
        {
            silly_branch(k);
            bigarray[k] = k;
        }
    }, "vector");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0);
    return sum;
}
Run Code Online (Sandbox Code Playgroud)

即使声明x为易失性,这只会对测量产生1%的影响.当然,你必须验证分支实际上是在操作码中,但我的汇编程序知识不允许我验证(at -O3).

现在有趣的一点是当我添加一个增量时会发生什么silly_branch:

unsigned x = 0;
void silly_branch(int k)
{
    if(k == x)
    {
        x = x < 10 ? 10 : x*2;
    }
    ++x;
}
Run Code Online (Sandbox Code Playgroud)

现在,修改后的routine2_modified运行速度比原来慢2倍routine2,与routine3_modified上面提出的包括UB提交内存页面相同.我没有发现这特别令人惊讶,因为它为循环中的每个写入添加了另一个写入,因此我们有两倍的工作和两倍的持续时间.


结论

那么你必须仔细查看汇编和分析工具来验证内存管理的假设,而额外的写入是一个很好的假设("正确").但我认为这些提示足够强大,可以说有一些更复杂的事情,而不仅仅是一个push_back更慢的分支.

这是完整的测试代码:

#include <iostream>
#include <iomanip>
#include <vector>
#include <numeric>
#include <chrono>
#include <string>
#include <cstring>

#define PROFILE(BLOCK, ROUTNAME) ProfilerRun([&](){do {BLOCK;} while(0);}, \
        ROUTNAME, __FILE__, __LINE__);
//#define PROFILE(BLOCK, ROUTNAME) BLOCK

template <typename T>
void ProfilerRun (T&&  func, const std::string& routine_name = "unknown",
                  const char* file = "unknown", unsigned line = 0)
{
    using std::chrono::duration_cast;
    using std::chrono::microseconds;
    using std::chrono::steady_clock;
    using std::cerr;
    using std::endl;

    steady_clock::time_point t_begin = steady_clock::now();

    // Call the function
    func();

    steady_clock::time_point t_end = steady_clock::now();
    cerr << "[" << std::setw (20)
         << (std::strrchr (file, '/') ?
             std::strrchr (file, '/') + 1 : file)
         << ":" << std::setw (5) << line << "]   "
         << std::setw (10) << std::setprecision (6) << std::fixed
         << static_cast<float> (duration_cast<microseconds>
                                (t_end - t_begin).count()) / 1e6
         << "s  --> " << routine_name << endl;

    cerr.unsetf (std::ios_base::floatfield);
}

using namespace std;

constexpr int N = (1 << 28);
constexpr int PAGESIZE = 4096;

uint64_t __attribute__((noinline)) routine1()
{
    uint64_t sum;
    int* bigarray = new int[N];
    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new (routine1)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine2()
{
    uint64_t sum;
    int* bigarray = new int[N];

    memset(bigarray, 0, sizeof(int)*N);

    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new + full memset (routine2)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine3()
{
    uint64_t sum;
    int* bigarray = new int[N];

    for(int k = 0; k < N; k += PAGESIZE/2/sizeof(int))
        bigarray[k] = 0;

    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new + strided memset (every page half) (routine3)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine4()
{
    uint64_t sum;
    int* bigarray = new int[N];

    for(int k = 0; k < N; k += PAGESIZE/1/sizeof(int))
        bigarray[k] = 0;

    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new + strided memset (every page) (routine4)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine5()
{
    uint64_t sum;
    int* bigarray = new int[N];

    for(int k = 0; k < N; k += PAGESIZE*2/sizeof(int))
        bigarray[k] = 0;

    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new + strided memset (every other page) (routine5)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine6()
{
    uint64_t sum;
    int* bigarray = new int[N];

    for(int k = 0; k < N; k += PAGESIZE*4/sizeof(int))
        bigarray[k] = 0;

    PROFILE (
    {
        for (int k = 0, *p = bigarray; p != bigarray+N; ++p, ++k)
            *p = k;
    }, "new + strided memset (every 4th page) (routine6)");
    sum = std::accumulate (bigarray, bigarray + N, 0ULL);
    delete [] bigarray;
    return sum;
}

uint64_t __attribute__((noinline)) routine7()
{
    uint64_t sum;
    vector<int> bigarray (N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray[k] = k;
    }, "vector, using ctor to initialize (routine7)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine8()
{
    uint64_t sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "vector (+ no reserve) + push_back (routine8)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine9()
{
    uint64_t sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "vector + reserve + push_back (routine9)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine10()
{
    uint64_t sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    memset(bigarray.data(), 0, sizeof(int)*N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.push_back (k);
    }, "vector + reserve + memset (UB) + push_back (routine10)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

template<class T>
void __attribute__((noinline)) adjust_size(std::vector<T>& v, int k, double factor)
{
    if(k >= v.size())
    {
        v.resize(v.size() < 10 ? 10 : k*factor);
    }
}

uint64_t __attribute__((noinline)) routine11()
{
    uint64_t sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
        {
            adjust_size(bigarray, k, 1.5);
            bigarray[k] = k;
        }
    }, "vector + custom emplace_back @ factor 1.5 (routine11)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine12()
{
    uint64_t sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
        {
            adjust_size(bigarray, k, 2);
            bigarray[k] = k;
        }
    }, "vector + custom emplace_back @ factor 2 (routine12)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine13()
{
    uint64_t sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
        {
            adjust_size(bigarray, k, 3);
            bigarray[k] = k;
        }
    }, "vector + custom emplace_back @ factor 3 (routine13)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine14()
{
    uint64_t sum;
    vector<int> bigarray;
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.emplace_back (k);
    }, "vector (+ no reserve) + emplace_back (routine14)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine15()
{
    uint64_t sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.emplace_back (k);
    }, "vector + reserve + emplace_back (routine15)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

uint64_t __attribute__((noinline)) routine16()
{
    uint64_t sum;
    vector<int> bigarray;
    bigarray.reserve (N);
    memset(bigarray.data(), 0, sizeof(bigarray[0])*N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
            bigarray.emplace_back (k);
    }, "vector + reserve + memset (UB) + emplace_back (routine16)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

unsigned x = 0;
template<class T>
void /*__attribute__((noinline))*/ silly_branch(std::vector<T>& v, int k)
{
    if(k == x)
    {
        x = x < 10 ? 10 : x*2;
    }
    //++x;
}

uint64_t __attribute__((noinline)) routine17()
{
    uint64_t sum;
    vector<int> bigarray(N);
    PROFILE (
    {
        for (int k = 0; k < N; ++k)
        {
            silly_branch(bigarray, k);
            bigarray[k] = k;
        }
    }, "vector, using ctor to initialize + silly branch (routine17)");
    sum = std::accumulate (begin (bigarray), end (bigarray), 0ULL);
    return sum;
}

template<class T, int N>
constexpr int get_extent(T(&)[N])
{  return N;  }

int main()
{
    uint64_t results[] = {routine2(),
    routine1(),
    routine2(),
    routine3(),
    routine4(),
    routine5(),
    routine6(),
    routine7(),
    routine8(),
    routine9(),
    routine10(),
    routine11(),
    routine12(),
    routine13(),
    routine14(),
    routine15(),
    routine16(),
    routine17()};

    std::cout << std::boolalpha;
    for(int i = 1; i < get_extent(results); ++i)
    {
        std::cout << i << ": " << (results[0] == results[i]) << "\n";
    }
    std::cout << x << "\n";
}
Run Code Online (Sandbox Code Playgroud)

在旧计算机和慢速计算机上运行示例; 注意:

  • N == 2<<28,而不是2<<29OP
  • 用g ++ 4.9 20131022编译 -std=c++11 -O3 -march=native
[            temp.cpp:   71]     0.654927s  --> new + full memset (routine2)
[            temp.cpp:   54]     1.042405s  --> new (routine1)
[            temp.cpp:   71]     0.605061s  --> new + full memset (routine2)
[            temp.cpp:   89]     0.597487s  --> new + strided memset (every page half) (routine3)
[            temp.cpp:  107]     0.601271s  --> new + strided memset (every page) (routine4)
[            temp.cpp:  125]     0.783610s  --> new + strided memset (every other page) (routine5)
[            temp.cpp:  143]     0.903038s  --> new + strided memset (every 4th page) (routine6)
[            temp.cpp:  157]     0.602401s  --> vector, using ctor to initialize (routine7)
[            temp.cpp:  170]     3.811291s  --> vector (+ no reserve) + push_back (routine8)
[            temp.cpp:  184]     2.091391s  --> vector + reserve + push_back (routine9)
[            temp.cpp:  199]     1.375837s  --> vector + reserve + memset (UB) + push_back (routine10)
[            temp.cpp:  224]     8.738293s  --> vector + custom emplace_back @ factor 1.5 (routine11)
[            temp.cpp:  240]     5.513803s  --> vector + custom emplace_back @ factor 2 (routine12)
[            temp.cpp:  256]     5.150388s  --> vector + custom emplace_back @ factor 3 (routine13)
[            temp.cpp:  269]     3.789820s  --> vector (+ no reserve) + emplace_back (routine14)
[            temp.cpp:  283]     2.090259s  --> vector + reserve + emplace_back (routine15)
[            temp.cpp:  298]     1.288740s  --> vector + reserve + memset (UB) + emplace_back (routine16)
[            temp.cpp:  325]     0.611168s  --> vector, using ctor to initialize + silly branch (routine17)
1: true
2: true
3: true
4: true
5: true
6: true
7: true
8: true
9: true
10: true
11: true
12: true
13: true
14: true
15: true
16: true
17: true
335544320

  • 令人难以置信的是,这不是公认的答案 (5认同)

Die*_*ühl 8

在构造函数中分配数组时,编译器/库基本上可以memset()是原始填充,然后只需设置每个单独的值.使用时push_back(),std::vector<T>课程需要:

  1. 检查是否有足够的空间.
  2. 将结束指针更改为新位置.
  3. 设置实际值.

最后一步是在一次性分配内存时唯一需要做的事情.