组织多个实现（对于 SIMD）

Question

组织多个实现（对于 SIMD）

Mat*_* M. 5 c++ simd instruction-set intrinsics

诚然，这是一个开放式/主观问题，但我正在寻找关于如何“组织”相同功能的多个替代实现的不同想法。

我有一组几个函数，每个函数都有特定于平台的实现。具体来说，它们对于特定的 SIMD 类型都有不同的实现：NEON（64 位）、NEON（128 位）、SSE3、AVX2 等（以及一种非 SIMD 实现）。

所有函数都有非 SIMD 实现。并非所有函数都专用于每种 SIMD 类型。

目前，我有一个整体文件，它使用一堆 #ifdef 来实现特定的 SIMD 专业化。当我们仅将少数功能专门用于一种或两种 SIMD 类型时，它就起作用了。现在，它变得笨拙了。

实际上，我需要一些功能类似于虚拟/覆盖的东西。非 SIMD 实现在基类中实现，SIMD 专门化（如果有）将覆盖它们。但我不想要实际的运行时多态性。该代码对性能至关重要，许多函数可以（并且应该）内联。

沿着这些思路的东西可以实现我所需要的（这仍然是#ifdefs的混乱）。

// functions.h

void function1();
void function2();

#ifdef __ARM_NEON
#include "functions_neon64.h"
#elif __SSE3__
#include "functions_sse3.h"
#endif

#include "functions_unoptimized.h"

Run Code Online (Sandbox Code Playgroud)

// functions_neon64.h
#ifndef FUNCTION1_IMPL
#define FUNCTION1_IMPL
void function1() {
  // NEON64 implementation
}
#endif

Run Code Online (Sandbox Code Playgroud)

// functions_sse3.h
#ifndef FUNCTION2_IMPL
#define FUNCTION2_IMPL
void function2() {
  // SSE3 implementation
}
#endif

Run Code Online (Sandbox Code Playgroud)

// functions_unoptimized.h
#ifndef FUNCTION1_IMPL
#define FUNCTION1_IMPL
void function1() {
  // Non-SIMD implementation
}
#endif

#ifndef FUNCTION2_IMPL
#define FUNCTION2_IMPL
void function2() {
  // Non-SIMD implementation
}
#endif

Run Code Online (Sandbox Code Playgroud)

有人有更好的想法吗？

Answer 1

Tur*_*ght 5

以下只是我在思考时想到的一些想法 - 可能有我不知道的更好的解决方案。

1. 标签发送

使用 Tag-Dispatch 您可以定义编译器应考虑函数的顺序，例如在本例中是

AVX2 -> SSE3 -> Neon128 -> Neon64 -> None

Run Code Online (Sandbox Code Playgroud)

将使用该链中的第一个实现：godbolt 示例

/**********************************
 ** functions.h *******************
 *********************************/

struct SIMD_None_t {};
struct SIMD_Neon64_t : SIMD_None_t {};
struct SIMD_Neon128_t : SIMD_Neon64_t {};
struct SIMD_SSE3_t : SIMD_Neon128_t {};
struct SIMD_AVX2_t : SIMD_SSE3_t {};
struct SIMD_Any_t : SIMD_AVX2_t  {};

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#endif

#ifdef __SSE3__
#include "functions_see3.h"
#endif

// etc...

#include "functions_stubs.h"



/**********************************
 ** functions_unoptimized.h *******
 *********************************/
inline int add(int a, int b, SIMD_None_t) {
    std::cout << "NONE" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon64.h ************
 *********************************/
inline int add(int a, int b, SIMD_Neon64_t) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon128.h ***********
 *********************************/
inline int add(int a, int b, SIMD_Neon128_t) {
    std::cout << "NEON128!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_stubs.h ************* 
 *********************************/
inline int add(int a, int b) {
    return add(a, b, SIMD_Any_t{});
}

/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    add(1, 2);
}

Run Code Online (Sandbox Code Playgroud)

这将输出NEON128!，因为这是本例中的最佳匹配。

优点：

#ifdef实现头文件中不需要
调用者不需要修改

缺点：

您需要为每个实现添加一个额外的参数
需要一个调度函数来提供额外的参数
（您实际上可以通过在调用该函数的任何地方添加来摆脱该函数, SIMD_Any_t{}，但这需要大量工作）

2. 将函数放入类中并使用名称查找来选择正确的函数

例如：

struct None { inline static int add(int a, int b) { return a + b; } };
struct Neon64 : None { inline static int add(int a, int b) { return a + b; } };
struct Neon128 : Neon64 {};

struct SIMD : Neon128 {};

// Usage:
int r = SIMD::add(1, 2);

Run Code Online (Sandbox Code Playgroud)

因为子类可以隐藏其基类的成员，所以这并不含糊。（始终是实现给定方法的最派生类将被调用，因此您可以对实现进行排序）

对于您的示例，它可能如下所示：godbolt example


#include <iostream>

/**********************************
 ** functions.h *******************
 *********************************/

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#else
  struct SIMD_Neon64 : SIMD_None {};
#endif

#ifdef __ARM_NEON_128
#include "functions_neon128.h"
#else
  struct SIMD_Neon128 : SIMD_Neon64 {};
#endif

// etc...

struct SIMD : SIMD_Neon128 {};


/**********************************
 ** functions_unoptimized.h *******
 *********************************/
struct SIMD_None {
    inline static int sub(int a, int b) {
        std::cout << "NONE" << std::endl;
        return a - b;
    }
};

/**********************************
 ** functions_neon64.h ************
 *********************************/
struct SIMD_Neon64 : SIMD_None {
    inline static int sub(int a, int b) {
        std::cout << "Neon64" << std::endl;
        return a - b;
    }
};

/**********************************
 ** functions_neon128.h ***********
 *********************************/
struct SIMD_Neon128 : SIMD_Neon64 {
    inline static int sub(int a, int b) {
        std::cout << "Neon128" << std::endl;
        return a - b;
    }
};


/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    SIMD::sub(2, 3);
}

Run Code Online (Sandbox Code Playgroud)

这将输出Neon128.

优点：

#ifdef实现头文件中不需要
不需要调度函数，编译器会自动选择最好的一个
不需要额外的函数参数

缺点：

您需要更改对函数的所有调用并为其添加前缀SIMD::
您需要将所有函数包装在结构体中并使用继承，因此有点复杂

3. 使用模板特化

如果您有所有可能的 SIMD 实现的枚举，例如：

enum class SIMD_Type {
    Min, // Dummy Value -> No Implementation found

    None,
    Neon64,
    Neon128,
    SSE3,
    AVX2,

    Max // Dummy Value -> Search downwards from here
};

Run Code Online (Sandbox Code Playgroud)

您可以使用它（递归地）遍历它们，直到找到专门的一个，例如：

template<SIMD_Type type = SIMD_Type::Max>
inline int add(int a, int b) {
    constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
    return add<nextType>(a, b);
}

template<>
inline int add<SIMD_Type::Neon64>(int a, int b) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

Run Code Online (Sandbox Code Playgroud)

这里，对的调用add(1, 2)将首先调用add<SIMD_Type::Max>，后者又会调用add<SIMD_Type::AVX2, add<SIMD_Type::SSE3>, add<SIMD_Type::Neon128>，然后对的调用add<SIMD_Type::Neon64>将调用特化，因此递归在此停止。

如果你想让它更安全一点（以防止长模板实例化链），你可以另外为每个函数添加一个专门化，如果找不到任何专门化，该函数就会停止递归，例如：godbolt 示例

template<>
inline int add<SIMD_Type::Min>(int a, int b) {
    static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
    return {};
}

Run Code Online (Sandbox Code Playgroud)

在你的情况下，它可能看起来像这样：

#include <iostream>

/**********************************
 ** functions.h *******************
 *********************************/
enum class SIMD_Type {
    Min, // Dummy Value -> No Implementation found

    None,
    Neon64,
    Neon128,
    SSE3,
    AVX2,

    Max // Dummy Value -> Search downwards from here
};

#include "functions_stubs.h"

#include "functions_unoptimized.h"

#ifdef __ARM_NEON
#include "functions_neon64.h"
#endif

#ifdef __SSE3__
#include "functions_see3.h"
#endif

// etc...

/**********************************
 ** functions_stubs.h *************
 *********************************/
template<SIMD_Type type = SIMD_Type::Max>
inline int add(int a, int b) {
    constexpr SIMD_Type nextType = static_cast<SIMD_Type>(static_cast<int>(type) - 1);
    return add<nextType>(a, b);
}

template<>
inline int add<SIMD_Type::Min>(int a, int b) {
    static_assert(SIMD_Type::Min == SIMD_Type::Min, "No implementation found!");
    return {};
}

/**********************************
 ** functions_unoptimized.h *******
 *********************************/
template<>
inline int add<SIMD_Type::None>(int a, int b) {
    std::cout << "NONE" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon64.h ************
 *********************************/
template<>
inline int add<SIMD_Type::Neon64>(int a, int b) {
    std::cout << "NEON!" << std::endl;
    return a + b;
}

/**********************************
 ** functions_neon128.h *******************
 *********************************/
template<>
inline int add<SIMD_Type::Neon128>(int a, int b) {
    std::cout << "NEON128!" << std::endl;
    return a + b;
}

/**********************************
 ** main.cpp **********************
 *********************************/
#include "functions.h"

int main() {
    add(1, 2);
}

Run Code Online (Sandbox Code Playgroud)

会输出NEON128!.

优点：

实现头文件中不需要#ifdef
调用者不需要修改

缺点：

需要一个额外的调度函数来递归地调用自身（直到它达到专门化）
编译器可能不会优化所有递归调用（尽管大多数编译器可能会）大多数编译器还为您提供了一种强制内联某些函数（ / ）
的方法，您可以添加函数基本模板以确保所有递归调用实际上都被内联。__attribute__((always_inline))__forceinline
（可选）需要另一个函数来停止递归实例化（不是严格要求，编译器将在某个时刻停止递归实例化）

4. 每个函数一个文件

这是迄今为止最简单的选项 - 只需将每个函数（或类似函数的集合）放入一个文件中并#ifdef在那里执行 ' 即可。

这样，您就可以在一个文件中获得 SIMD 的所有函数及其专门化，这也将使编辑变得更加容易。

例如：

/**********************************
 ** functions.h *******************
 *********************************/

#include "functions_add.h"
#include "functions_sub.h"
// etc...

/**********************************
 ** functions_add.h ***************
 *********************************/
#ifdef __SSE3__
// SSE3
int add(int a, int b) {
  return a + b;
}
#elifdef __ARM_NEON
// NEON
int add(int a, int b) {
  return a + b;
}
#else
// Fallback
int add(int a, int b) {
  return a + b;
}
#end

/**********************************
 ** functions_sub.h ***************
 *********************************/
#ifdef __SSE3__
// SSE3
int sub(int a, int b) {
  return a - b;
}
#elifdef __ARM_NEON_128
// NEON 128
int sub(int a, int b) {
  return a - b;
}
#else
// Fallback
int sub(int a, int b) {
  return a - b;
}
#end

Run Code Online (Sandbox Code Playgroud)

优点：

该函数及其所有专业化都在一个文件中，因此确定调用哪个函数要容易得多
只要您不将太多函数塞入单个文件中，就易于实现和维护

缺点：

可能有很多头文件
#ifdef需要在每个标题中重复

归档时间：	3 年，10 月前
查看次数：	347 次
最近记录：	3 年，10 月前