Clang++：为什么在添加另一个结构成员时这个 memcpy 循环惯用法没有优化？

He3*_*xxx 5 c++ clang compiler-optimization

鉴于此代码片段

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

Run Code Online (Sandbox Code Playgroud)

clang 用 memcpy 调用替换循环foo，正如预期的那样 ( godbolt )，给出 Rpass 输出：

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

Run Code Online (Sandbox Code Playgroud)

但是，当我取消注释uint64_t b;中的第二个成员时Data，它不再这样做（godbolt）。这是有原因的，还是这只是一个错过的优化？在后一种情况下，是否有任何技巧可以让 clang 应用此优化？

我注意到，如果我更改value为 type 类型Data&（即：删除临时的本地副本），memcpy 优化仍然会应用（godbolt）。

编辑：彼得在评论中指出，这种更简单/噪音更少的方法也会发生同样的事情：

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

Run Code Online (Sandbox Code Playgroud)

问题仍然存在：为什么没有优化？

归档时间：	3 年，6 月前
查看次数：	249 次
最近记录：	3 年，6 月前