Why is iterating over std::ranges::views::join so slow

Question

Why is iterating over std::ranges::views::join so slow

joe*_*ech 26 c++ benchmarking c++20 std-ranges

This is a follow-up of this SO Answer. Given a flat input range and three size_t dimensions, the code creates a nested random_access_range of random_access_ranges of random_access_ranges, modelling a three-dimensional array.

Quickbench

Iterating over the elements in the "multidimensional" range using a nested-for loop and indices is a bit slower than directly iterating over the elements of the input range (factor 4 slower). I suppose some performance drop can be expected, but a factor of 4 hurts a bit.

Even worse, recursively views::joining the multidimensional range back to a flat range and iterating over this flat range is slower by a factor of 20. Having read through the comments in this Github issue it can be expected that views::join will come with some additional overhead, but a factor of 20 seems a bit much.

How can this large overhead with views::join be explained? Am I using it wrong or is something fishy with my benchmark? Is there anything that can be done to speed up the code or are ranges just a poor choice for this kind of application?

Implementation

The code can be found under the Quickbench link above, I will add it here for completeness:

#include <vector>
#include <ranges>
#include <cassert>
#include <iostream>

template <size_t dim>
struct Slice {
    // default constructor leaves start at zero and end at dim. Correspondes to the whole dimension
    constexpr Slice() = default;

    // Create a slice with a single index
    constexpr Slice(size_t i) : begin(i), end(i+1) {
        assert( (0 <= i) && (i < dim));
    }

    // Create a slice with a start and an end index
    constexpr Slice(size_t s, size_t e) : begin(s), end(e+1) {
        assert( (0 <= s) && (s <= e) && (e < dim) );
    } 

    size_t begin {0};
    size_t end {dim};
};

// An adaptor object to interpret a flat range as a multidimensional array
template <size_t dim, size_t... dims>
struct MD {
    constexpr static auto dimensions = std::make_tuple(dim, dims...);
    consteval static size_t size(){
        if constexpr (sizeof...(dims) > 0) {
            return dim*(dims * ...);
        }
        else {
            return dim;
        }
    }

    // returns a multidimensional range over the elements in the flat array
    template <typename Rng>
    constexpr static auto slice(
        Rng&& range, 
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {        
        return slice_impl(range, 0, slice, slices...);
    }

    template <typename Rng>
    constexpr static auto slice_impl(
        Rng&& range, 
        size_t flat_index,  
        Slice<dim> const& slice, 
        Slice<dims> const&... slices
    )
    {
        if constexpr (std::ranges::sized_range<Rng>) { assert(std::size(range) >= size());  }
        static_assert(sizeof...(slices) == sizeof...(dims), "wrong number of slice arguments.");

        if constexpr (sizeof...(slices) == 0) 
        {
            // end recursion at inner most range
            return range | std::views::drop(flat_index*dim + slice.begin) | std::views::take(slice.end - slice.begin);
        }
        else 
        {
            // for every index to be kept in this dimension, recurse to the next dimension and increment the flat_index
            return std::views::iota(slice.begin, slice.end) | std::views::transform(
                [&range, flat_index, slices...](size_t i){
                    return MD<dims...>::slice_impl(range, flat_index*dim + i, slices...);
                }
            );
        }
    }

    // convenience overload for the full view
    template <typename Rng>
    constexpr static auto slice(Rng&& range){
        return slice(range, Slice<dim>{}, Slice<dims>{}...);
    }


};

// recursively join a range of ranges
// https://stackoverflow.com/questions/63249315/use-of-auto-before-deduction-of-auto-with-recursive-concept-based-fun
template <typename Rng>
auto flat(Rng&& rng) {
    using namespace std::ranges;

    auto joined = rng | views::join;    
    if constexpr (range<range_value_t<decltype(joined)>>) {
        return flat(joined);
    } else {
        return joined;
    }
}

Run Code Online (Sandbox Code Playgroud)

Test case

Iterate over two 6x6x6 slices out of 10x10x10 arrays and add the elements of one slice to another.

// define the dimensions of a 3d-array
constexpr size_t nx{10}, ny{10}, nz{10};

// define the contents of two nx x ny x nz arrays in and out
std::vector<double> Vout(nx*ny*nz, 0.);
std::vector<double> Vin(nx*ny*nz, 1.23);

// define some slice indices for each dimension
size_t lx{0}, ly{0}, lz{0};
size_t hx{5}, hy{5}, hz{5};

auto r_in = MD<nx,ny,nz>::slice(Vin, {lx, hx}, {ly, hy}, {lz, hz});
auto r_out = MD<nx,ny,nz>::slice(Vout, {lx, hx}, {ly, hy}, {lz, hz});

Run Code Online (Sandbox Code Playgroud)

Traditional for-loop

for (int k=lz; k<hz; ++k)
    for (int j=ly; j<hy; ++j)
        for (int i=lx; i<hx; ++i)
            Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

Run Code Online (Sandbox Code Playgroud)

MDRanges setup

This benchmark just tests the time of creating the two MD<2,3,2> objects and the flat ranges without iterating over them.

Loop over flattened/joined ranges

// C++23: for (auto [o, i] : std::views::zip(flat(r_out), flat(r_in))) { o = i; }

auto r_in_flat = flat(r_in);
auto r_out_flat = flat(r_out);

auto o = r_out_flat.begin();
auto i = r_in_flat.begin();
for(; o != r_out_flat.end(); i++, o++){
    *o += *i;
}

Run Code Online (Sandbox Code Playgroud)

Nested loop using ranges

for (size_t x = 0; x <= hx-lx; ++x)
    for (size_t y = 0; y <= hy-ly; ++y)
        for (size_t z = 0; z <= hz-lz; ++z)
            r_out[x][y][z] += r_in[x][y][z];

Run Code Online (Sandbox Code Playgroud)

Edit 1:

I found an issue with the benchmark: The traditional loop missed some values because I used < in the condition of the for-loops where I should've used <=.

for (int k=lz; k<=hz; ++k)
  for (int j=ly; j<=hy; ++j)
      for (int i=lx; i<=hx; ++i)
          Vout[i+nx*(j+ny*k)] += Vin[i+nx*(j+ny*k)];

Run Code Online (Sandbox Code Playgroud)

With this, the difference is a little less dramatic: The nested loop using ranges is 2x slower than the traditional loop and the loop over the joined ranges 12x slower. Still, I would have hoped for a lower penalty.

Quickbench

Edit 2:

受@Newbies 评论的启发，我使用 1x1xN 数组运行了一些基准测试。有趣的是，第一次快速检查显示了非常糟糕的结果，其中非连接范围实现比本机嵌套循环慢 450 倍：https://quick-bench.com/q/-ZHPSTtvF4EZVg3JmuqMec4TYyU。

因此，我使用 1xN 数组运行了一些测试，以对我在实现中使用的范围模式进行基准测试：

drop_take：在最右边的维度中，我只是简单地输入std::views::drop前几个元素和std::views::take我要查找的元素数量。该范围的形式为take(drop(input_range)). 这种take_drop模式工作得很好，并且迭代它基本上与迭代输入范围一样快。

iota_transform：在除最右边的维度之外的所有其他维度中，我将每个索引std::views::transform的元素std::views::iota通过递归从右邻居维度获得的范围内。因此，对于右数第二个维度，我们创建一系列形式的范围transform(iota, LAMBDA(take(drop(input_range))))。基准测试表明，这会导致计算时间加倍（大概是因为缺乏矢量化？）。

join：不是一个“模式”，但我包含了一个迭代的基准join(transform(iota, LAMBDA(take(drop(input_range)))))。性能再次下降 5.5 倍。

快速工作台

那么iota_transform模式也许是一种反模式？使用std::views::iota基于索引列表构建一系列范围对我来说似乎是规范的，尽管开发人员可能没有考虑到范围作为输出std::views::transform。我想要迭代的实际范围位于传递给转换的 lambda 表达式中，所以这可能是编译器优化的硬障碍？

std::views::join但即便如此，仍然没有回答为什么速度要慢得多的问题。我无法理解为什么它需要 5.5 倍的计算时间。

Answer 1

tyl*_*ylo 4

我想您可能会在大约 30 - 45 分钟的演讲“ MultiDimension C++ - Bryce Adelstein Lelbach - CppNorth 2022 ”中找到答案。

彼得：公平地说，我会给出完整的答案：

总之，大多数优化器无法对展平/连接迭代器进行矢量化。他展示了几种方法，其中迭代器包含当前的 MD 索引，一种方法仅包含线性索引（然后必须重建 MD 索引，这更糟糕），以及一种使用协程的方法，该方法在某些编译器上效果更好 - 所以它需要良好的协程实现+编译器支持。

所谓的糟糕性能导致了这样的结论：C++23 std::mdspan 不会有扁平迭代器支持，我认为这是不明智的，但我想它可以在将来添加。

我最近碰巧在纯 C 中实现了一个多跨度类型，我发现扁平迭代器的速度大约是嵌套迭代器速度的一半。这就是我所期望的——考虑到这种功能的好处，这是非常值得的。

这里有一个godbolt 链接，它表明扁平化迭代性能可以很好（至少在 C 语言中）。

归档时间：	2 年，10 月前
查看次数：	1450 次
最近记录：	2 年，7 月前