boost::pool_allocator 明显慢于 std::allocator

Vic*_*ian 2 c++ boost boost-pool

我正在学习内存池,并尝试boost::pool_allocator在我的项目中使用。根据文档,我做了一个关于时间成本的小测试:

template <typename Alloc>
void test()
{
    using namespace std::chrono;
    auto t0 = high_resolution_clock::now();
    for (int i = 0; i < 1000; ++i) {
        std::vector<int, Alloc> vec;
        for (int j = 0; j < 10000; ++j)
            vec.push_back(i + j);
    }
    auto t1 = high_resolution_clock::now();
    auto time_ms = duration<double>(t1 - t0).count() * 1e3;
    cout << "time cost: " << time_ms << " ms" << endl;
}

int main()
{
    test<std::allocator<int>>();
    test<boost::pool_allocator<int>>();
}
Run Code Online (Sandbox Code Playgroud)

结果是:

time cost: 3.97602 ms
time cost: 91.3943 ms
Run Code Online (Sandbox Code Playgroud)

Boost文档说:

池通常用于小对象的大量分配和释放。

所以我预计比上面的代码boost::pool_allocator花费的时间要少,但测试结果表明它要糟糕得多。std::allocator

难道是我用boost::pool_allocator错了?在什么情况下我可以通过使用内存池(或只是Boost pool/pool_allocator)来获得加速?

seh*_*ehe 7

以供参考:

template <typename T,
    typename UserAllocator = default_user_allocator_new_delete,
    typename Mutex = details::pool::default_mutex,
    unsigned NextSize = 32,
    unsigned MaxSize = 0>
class pool_allocator;
Run Code Online (Sandbox Code Playgroud)

我想也许是锁定造成的。另外,可能还有更好的暗示。

我们来测试一下!实时编译器资源管理器

#include <boost/core/demangle.hpp>
#include <boost/pool/pool_alloc.hpp>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <vector>

using namespace std::chrono_literals;
auto static now = std::chrono::high_resolution_clock::now;

template <typename Alloc> void test(int run, Alloc alloc = {}) {
    auto load = [=](bool RESERVE, unsigned ITERATIONS = 1'000, unsigned SIZE = 10'000) {
        for (unsigned i = 0; i < ITERATIONS; ++i) {
            std::vector<int, Alloc> vec(alloc);
            if (RESERVE)
                vec.reserve(SIZE);
            for (unsigned j = 0; j < SIZE; ++j)
                vec.push_back(i + j);
        }
    };

    auto lap_time = [t0 = now()]() mutable {
        return now() - std::exchange(t0, now());
    };

    load(false); auto without_reserve = lap_time() / 1.0ms;
    load(true);  auto with_reserve    = lap_time() / 1.0ms;

    std::cout << "run " << run                                             //
              << " naive:    " << std::setw(7) << without_reserve << "ms"  //
              << " reserved: " << std::setw(7) << with_reserve    << "ms"  //
              << "(" << boost::core::demangle(typeid(Alloc).name()) << ")" //
              << std::endl;
}

void run_tests(int run) {
    test<std::allocator<int>>(run);

    using NullMx    = boost::details::pool::null_mutex;
    using Mx        = boost::details::pool::default_mutex;
    using Malloc    = boost::default_user_allocator_malloc_free;
    using NewDelete = boost::default_user_allocator_new_delete;

    // 
    // no hints
    //
    test<boost::pool_allocator<int, Malloc,    NullMx>>(run);
    test<boost::pool_allocator<int, NewDelete, NullMx>>(run);
    test<boost::pool_allocator<int, Malloc,    Mx>>(run);
    test<boost::pool_allocator<int, NewDelete, Mx>>(run);

    //
    // hinted
    //
    test<boost::pool_allocator<int, Malloc,    NullMx, 1'000, 0>>(run);
    test<boost::pool_allocator<int, NewDelete, NullMx, 1'000, 0>>(run);
    test<boost::pool_allocator<int, Malloc,    Mx,     1'000, 0>>(run);
    test<boost::pool_allocator<int, NewDelete, Mx,     1'000, 0>>(run);
}

int main()
{
    std::cout << std::fixed << std::setprecision(3);

    for (int run : {1,2,3}) {
        auto t0 = now();
        run_tests(run);
        std::cout << " -- Done (" << (now() - t0) / 1.ms << "ms)" << std::endl;
    }
}
Run Code Online (Sandbox Code Playgroud)

编译器资源管理器显示了一些真正不一致的峰值;我自己的机器没有:

run 1 naive:      8.025ms reserved:   5.412ms(std::allocator<int>)
run 1 naive:     92.212ms reserved:  31.166ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:     93.466ms reserved:  29.901ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:     92.488ms reserved:  29.883ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 1 naive:     92.450ms reserved:  29.824ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 1 naive:     82.879ms reserved:  27.478ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:     82.775ms reserved:  28.187ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:     83.189ms reserved:  27.404ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 1 naive:     83.159ms reserved:  27.468ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (947.595ms)
run 2 naive:      8.007ms reserved:   5.543ms(std::allocator<int>)
run 2 naive:     92.225ms reserved:  29.882ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:     92.311ms reserved:  29.805ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:     92.601ms reserved:  29.873ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 2 naive:     92.421ms reserved:  30.028ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 2 naive:     83.028ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:     82.822ms reserved:  27.427ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:     83.230ms reserved:  27.493ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 2 naive:     83.104ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (944.958ms)
run 3 naive:      8.068ms reserved:   5.422ms(std::allocator<int>)
run 3 naive:     92.282ms reserved:  29.880ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:     92.064ms reserved:  29.960ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:     92.339ms reserved:  29.928ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 3 naive:     92.977ms reserved:  29.890ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 3 naive:     82.906ms reserved:  27.388ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:     82.784ms reserved:  27.585ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:     83.157ms reserved:  28.233ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 3 naive:     83.098ms reserved:  27.466ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (945.629ms)
Run Code Online (Sandbox Code Playgroud)

不过,显然,总是慢一些。让我们介绍一下

分析器

将标准分配器与以下内容进行比较boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>

  • 标准分配器会导致 48k 次 new/delete 调用,从而导致对/ 的底层调用同样多mallocfree

    在此输入图像描述 在此输入图像描述

  • 池分配器显示数量大大减少:

    在此输入图像描述 在此输入图像描述

    对于 malloc/free:

    在此输入图像描述 在此输入图像描述

到目前为止,一切都很好!那还花那么多时间干什么?

在此输入图像描述

其中unordered_malloc内联到来自各种 Boost Pool 标头的大量行。最严重的违规者是从以下内容内联的boost/pool/simple_segregated_storage.hpp:(第二列是相对于父级的成本百分比):

在此输入图像描述

这些行是在try_malloc_n

template <typename SizeType>
void * simple_segregated_storage<SizeType>::try_malloc_n(
    void * & start, size_type n, const size_type partition_size)
{
  void * iter = nextof(start);
  while (--n != 0)
  {
    void * next = nextof(iter);
    if (next != static_cast<char *>(iter) + partition_size)
    {
      // next == 0 (end-of-list) or non-contiguous chunk found
      start = iter;
      return 0;
    }
    iter = next;
  }
  return iter;
}
Run Code Online (Sandbox Code Playgroud)

其自我描述为:

该函数尝试在空闲列表中从 start 开始查找 n 个大小为 partition_size 的连续块。如果成功,它会返回该连续序列中的最后一个块,以便 [start, {retval}] 知道该序列。如果失败,它会执行此操作,因为它位于空闲列表的末尾,或者命中了非-连续的块。无论哪种情况,它都会返回 0,并将 start 设置为最后考虑的块。如果 nextof(start) == 0,则位于空闲列表的末尾。否则,start 指向连续序列中的最后一个块,nextof(start) 指向下一个连续序列中的第一个块(假设有序)空闲列表)。

毕竟,在这种情况下,在隔离堆上追逐空闲块的成本确实太高了餐巾纸上的一点计算显示,这try_malloc_n占了我们之前看到的高层unordered_malloc调用的 99.75%。

在此输入图像描述

震惊:替代实施?

在我的调查过程中,我发现了许多可用于获得更多见解的定义,例如:

#define NDEBUG
//#define BOOST_POOL_INSTRUMENT 1
//#define BOOST_POOL_VALIDATE 1
//#define BOOST_POOL_VALGRIND 1
Run Code Online (Sandbox Code Playgroud)

现在,我使用 VALIDATE/INSTRUMENT 观察预期效果(非常详细的输出和轻微的性能下降)。

通过阅读名称/代码,我预计会BOOST_POOL_VALGRIND类似地降低性能(毕竟,它可能应该做额外的工作以避免运行 Valgrind 时出现误报内存错误,对吧?)。令我惊讶的是,定义它使整个事情运行得快如闪电Live On Compiler Explorer

run 1 naive:      8.166ms reserved:   5.267ms(std::allocator<int>)
run 1 naive:      9.713ms reserved:   5.267ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:      8.853ms reserved:   5.226ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 1 naive:      8.990ms reserved:   5.282ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 1 naive:      8.899ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 1 naive:      8.620ms reserved:   5.237ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:      8.622ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 1 naive:      8.963ms reserved:   5.257ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 1 naive:      8.990ms reserved:   5.271ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (127.276ms)
run 2 naive:      7.965ms reserved:   5.208ms(std::allocator<int>)
run 2 naive:      8.503ms reserved:   5.236ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:      8.809ms reserved:   5.254ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 2 naive:      8.954ms reserved:   5.278ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 2 naive:      8.878ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 2 naive:      8.694ms reserved:   5.243ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:      8.661ms reserved:   5.249ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 2 naive:      8.920ms reserved:   5.248ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 2 naive:      8.952ms reserved:   5.261ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
 -- Done (125.680ms)
run 3 naive:      7.949ms reserved:   5.221ms(std::allocator<int>)
run 3 naive:      8.498ms reserved:   5.238ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:      8.813ms reserved:   5.230ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 32u, 0u>)
run 3 naive:      9.033ms reserved:   5.279ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 32u, 0u>)
run 3 naive:      8.909ms reserved:   5.252ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 32u, 0u>)
run 3 naive:      8.605ms reserved:   5.244ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:      8.623ms reserved:   5.246ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, boost::details::pool::null_mutex, 1000u, 0u>)
run 3 naive:      8.918ms reserved:   5.247ms(boost::pool_allocator<int, boost::default_user_allocator_malloc_free, std::mutex, 1000u, 0u>)
run 3 naive:      8.969ms reserved:   5.268ms(boost::pool_allocator<int, boost::default_user_allocator_new_delete, std::mutex, 1000u, 0u>)
Run Code Online (Sandbox Code Playgroud)

遗憾的是,查看细节证实它通过委托给标准库直接进行欺骗(同时通过free_list/used_list地址集增加一些开销)。

在此输入图像描述

概括

是的,标准pool/simple_segregated_storage实现在这种负载下表现很差。我无法确定这是否真的是一个错误,但根据文档(您也提到过),它看起来确实是一个错误。