AVX2：计算 512 个浮点数组的点积

Question

AVX2：计算 512 个浮点数组的点积

App*_*per 15 c++ simd dot-product avx2 fma

我先说我是 SIMD 内在函数的完全初学者。

本质上，我有一个支持 AVX2 内在 ( Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz)的 CPU 。我想知道计算两个std::vector<float>size的点积的最快方法512。

我在网上做了一些挖掘，发现了这个和这个，这个堆栈溢出问题建议使用以下函数__m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);，但是，这些都表明了执行点积的不同方法我不确定什么是正确的（和最快的）方法它。

特别是，我正在寻找对大小为 512 的向量执行点积的最快方法（因为我知道向量大小会影响实现）。

感谢您的帮助

编辑 1：我也对-mavx2gcc 标志有点困惑。如果我使用这些 AVX2 函数，我编译时是否需要添加标志？另外，-OFast如果我编写了一个天真的点积实现，gcc 是否能够为我做这些优化（比如我使用gcc 标志）？

编辑 2 如果有人有时间和精力，如果您能编写完整的实现，我将不胜感激。我相信其他初学者也会重视这些信息。

Answer 1

Soo*_*nts 15

_mm256_dp_ps仅对 2 到 4 个元素的点积有用；对于更长的向量，在循环中使用垂直 SIMD 并在最后减少到标量。在循环中使用_mm256_dp_ps和_mm256_add_ps会慢得多。

与 MSVC 和 ICC 不同，GCC 和 clang 要求您启用（使用命令行选项）使用内部函数的 ISA 扩展。

下面的代码可能接近 CPU 的理论性能极限。未经测试。

用 clang 或gcc -O3 -march=native. （至少需要-mavx -mfma，但-mtune隐含的选项-march也很好，其他-mpopcnt和其他东西也arch=native可以启用。对于大多数具有 FMA 的 CPU 而言，调整选项对于有效编译至关重要，特别是-mno-avx256-split-unaligned-load：为什么 gcc 不将 _mm256_loadu_pd 解析为单个 vmovupd ? )

或者用 MSVC 编译 -O2 -arch:AVX2

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

Run Code Online (Sandbox Code Playgroud)

可能的进一步改进：

展开 8 个向量而不是 4 个。我检查了gcc 9.2 asm 输出，编译器只使用了 16 个可用向量中的 8 个向量寄存器。
确保两个输入向量排列，例如，使用自定义的分配器，其呼叫_aligned_malloc/_aligned_free上MSVC，或aligned_alloc/free海合会与铿锵。然后替换_mm256_loadu_ps为_mm256_load_ps.

要自动矢量化一个简单的标量点积，您还需要 OpenMP SIMD 或-ffast-math（由暗示-Ofast）让编译器将 FP 数学视为关联，即使它不是（因为四舍五入）。但是 GCC 在自动矢量化时不会使用多个累加器，即使它确实展开，所以你会遇到 FMA 延迟的瓶颈，而不是负载吞吐量。

（每个 FMA 2 次加载意味着此代码的吞吐量瓶颈是向量加载，而不是实际的 FMA 操作。）

SIMD + 多个累加器是实现成对求和的一部分；一种公认的技术，用于减轻 FP 数组求和的错误（其中将小元素添加到大总数中是不好的）。如果元素的大小大多相似，特别是如果它们都是正数，那么拥有许多较小的总数会显着减少舍入误差。请参阅我对 [Simd matmul 程序给出不同的数值结果](//stackoverflow.com/a/55478102) 的回答。 (2认同)

归档时间：	6 年，1 月前
查看次数：	3210 次
最近记录：	6 年，1 月前