为什么Range-v3在这个例子中比STL慢?

fas*_*eep 29 c++ performance range-v3

我正在玩Range-v3库以表现出一种荣耀,find_if并且好奇为什么谷歌基准测试一直将我的Range-v3代码排序比我的std::find_if方法更差.g ++和clang都用-O3和表示相同的模式#define NDEBUG.

我想到的具体例子是以下使用STL:

std::vector<int> lengths(large_number, random_number);

auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
auto accumulated_length = 0l;
auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                              accumulated_length += val;
                              return to_find < accumulated_length;
                          });
auto found_index = std::distance(lengths.begin(), found);    
Run Code Online (Sandbox Code Playgroud)

出于此说明的目的,这有点做作,但通常会有一个随机生成器用于向量中的to_find变量和随机值lengths.

使用Range-v3库,我得到以下代码

using namespace ranges;    

std::vector<int> lengths(large_number, random_number);

auto const to_find = accumulate(lengths, 0l) / 2;

auto found_index = distance(lengths | view::partial_sum()
                                    | view::take_while([=](auto const i) {
                                         return i < to_find;
                                      }));
Run Code Online (Sandbox Code Playgroud)

我的问题是为什么Range-v3比STL实现慢.我知道这仍然是一个实验性的库,但是代码示例可能有问题,或者我是否滥用了范围概念?

编辑

google-bench驱动程序示例(不确定是否正确)

#define NDEBUG

#include <numeric>
#include <vector>

#include <benchmark/benchmark.h>

#include <range/v3/all.hpp>

static void stl_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;

    while (state.KeepRunning()) {

        auto accumulated_length = 0l;
        auto const found = std::find_if(lengths.begin(), lengths.end(), [&](auto const& val) {
                               accumulated_length += val;
                               return to_find < accumulated_length;
                           });
        volatile long val = std::distance(lengths.begin(), found);
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

static void ranges_search(benchmark::State &state) {

    using namespace ranges;

    std::vector<long> lengths(state.range(0), 1l);

    auto const to_find = accumulate(lengths, 0l) / 2;

    while (state.KeepRunning())
    {
        volatile long val = distance(lengths | view::partial_sum()
                                             | view::take_while([=](auto const& i) {
                                                   return i <= to_find;
                                               }));
    }
    state.SetBytesProcessed(int64_t(state.iterations()) *
                            int64_t(state.range(0)) * sizeof(long));
}

BENCHMARK(ranges_search)->Range(8 << 8, 8 << 16);
BENCHMARK(stl_search)->Range(8 << 8, 8 << 16);

BENCHMARK_MAIN();
Run Code Online (Sandbox Code Playgroud)

ranges_search/2048          756 ns        756 ns     902091   20.1892GB/s
ranges_search/4096         1495 ns       1494 ns     466681   20.4285GB/s
ranges_search/32768       11872 ns      11863 ns      58902   20.5801GB/s
ranges_search/262144      94982 ns      94892 ns       7364   20.5825GB/s
ranges_search/524288     189870 ns     189691 ns       3688   20.5927GB/s
stl_search/2048             348 ns        348 ns    2000964   43.8336GB/s
stl_search/4096             690 ns        689 ns    1008295   44.2751GB/s
stl_search/32768           5497 ns       5492 ns     126097    44.452GB/s
stl_search/262144         44725 ns      44681 ns      15882   43.7122GB/s
stl_search/524288         91027 ns      90936 ns       7616   42.9563GB/s
Run Code Online (Sandbox Code Playgroud)

与clang 4.0.1和

ranges_search/2048         2309 ns       2307 ns     298507   6.61496GB/s
ranges_search/4096         4558 ns       4554 ns     154520   6.70161GB/s
ranges_search/32768       36482 ns      36454 ns      19191   6.69726GB/s
ranges_search/262144     287072 ns     286801 ns       2438   6.81004GB/s
ranges_search/524288     574230 ns     573665 ns       1209   6.80928GB/s
stl_search/2048             299 ns        298 ns    2340691   51.1437GB/s
stl_search/4096             592 ns        591 ns    1176783   51.6363GB/s
stl_search/32768           4692 ns       4689 ns     149460   52.0711GB/s
stl_search/262144         37718 ns      37679 ns      18611   51.8358GB/s
stl_search/524288         75247 ns      75173 ns       9244   51.9633GB/s
Run Code Online (Sandbox Code Playgroud)

与gcc 6.3.1.我的机器有一个Haswell一代处理器.两者都编译和执行

g++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
clang++ -Wall -O3 -std=c++14 Ranges.cpp -lbenchmark -lpthread && ./a.out
Run Code Online (Sandbox Code Playgroud)

Cas*_*sey 23

view::partial_sum超过一系列int产生一系列int.如果to_find > INT_MAX,内部累加器将溢出导致UB.在实践中,算法最有可能遍历整个输入并返回结束迭代器.

相反,你accumulated_length在非范围v3方法中是一个long.它不会溢出,因此在处理整个输入之前定义了行为/返回.

如果transform输入范围long在传递范围之前,则range-v3方法将具有正确的行为partial_sum:

auto found_index = distance(lengths
  | view::transform(convert_to<long>{}) | view::partial_sum()
  | view::take_while([=](auto const i) { return i < to_find; }));
Run Code Online (Sandbox Code Playgroud)

即使使用这种正确性修复,它仍然比我在测试中使用标准算法慢得多.编译此测试程序:

#include <chrono>
#include <iostream>
#include <random>
#include <vector>

#ifdef USE_RV3
#include <range/v3/core.hpp>
#include <range/v3/algorithm.hpp>
#include <range/v3/numeric.hpp>
#include <range/v3/view.hpp>

#else
#include <algorithm>
#include <numeric>
#endif

int main() {
    constexpr size_t large_number = 1UL << 30;

    int random_number = 42;

    std::vector<int> const lengths(large_number, random_number);

    using clock_t = std::chrono::steady_clock;
    auto const start = clock_t::now();

#ifdef USE_RV3
    auto const to_find = ranges::accumulate(lengths, 0l) / 2;
#else
    auto const to_find = std::accumulate(lengths.begin(), lengths.end(), 0l) / 2;
#endif

    auto const elapsed1 = clock_t::now() - start;

#ifdef USE_RV3
    auto const found_index = ranges::distance(lengths
            | ranges::view::transform(ranges::convert_to<long>{})
            | ranges::view::partial_sum()
            | ranges::view::take_while([=](auto const i) { return !(to_find < i); }));
#else
    auto accumulated_length = 0l;
    auto found = std::find_if(lengths.begin(), lengths.end(), [&](auto const &val) {
                                accumulated_length += val;
                                return to_find < accumulated_length;
                            });
    auto const found_index = std::distance(lengths.begin(), found);
#endif

    auto const elapsed2 = clock_t::now() - start;

    std::cout << "elapsed1: "
        << std::chrono::duration<double, std::milli>(elapsed1).count()
        << " ms, to_find: " << to_find << "\n"
           "elapsed2: "
        << std::chrono::duration<double, std::milli>(elapsed2).count()
        << " ms, result: " << found_index << '\n';
}
Run Code Online (Sandbox Code Playgroud)

g++-6 -std=c++14 -Ofast -march=native -DNDEBUG rv3.cpp -I ~/dev/range-v3/include -S -o -

没有和使用-DUSE_RV3和区分组件输出都很有趣.通过初始化生成的代码elapsed1对于两种情况都是相同的.初始化elapsed1和之间的中间部分有显着差异elapsed2.gcc在优化std版本方面做得更好:热循环在一个代码运行中与分支一起用于终止条件.范围-V3版本更加丑陋,并且跳了很多次; 我推测我们需要摆弄实现的细节,partial_sum以使其对优化器更加透明.