我有一些类实现了一些计算,我必须针对不同的SIMD实现进行优化,例如Altivec和SSE.我不想#ifdef ... #endif为每个我必须优化的方法使用块来对代码进行轮询,所以我尝试了其他几种方法,但不幸的是,我不太满意它的结果,原因我会试着澄清一下.所以我正在寻找一些关于如何改进我已经完成的建议.
1.包含原始数据的不同实现文件
我有相同的头文件描述类接口与普通C++,Altivec和SSE的不同"伪"实现文件仅用于相关方法:
// Algo.h
#ifndef ALGO_H_INCLUDED_
#define ALGO_H_INCLUDED_
class Algo
{
public:
Algo();
~Algo();
void process();
protected:
void computeSome();
void computeMore();
};
#endif
// Algo.cpp
#include "Algo.h"
Algo::Algo() { }
Algo::~Algo() { }
void Algo::process()
{
computeSome();
computeMore();
}
#if defined(ALTIVEC)
#include "Algo_Altivec.cpp"
#elif defined(SSE)
#include "Algo_SSE.cpp"
#else
#include "Algo_Scalar.cpp"
#endif
// Algo_Altivec.cpp
void Algo::computeSome()
{
}
void Algo::computeMore()
{
}
... same for the other implementation files
Run Code Online (Sandbox Code Playgroud)
优点:
#ifdef整个地方更清洁缺点:
2.具有私有继承的不同实现文件
// Algo.h
class Algo : private AlgoImpl
{
... as before
}
// AlgoImpl.h
#ifndef ALGOIMPL_H_INCLUDED_
#define ALGOIMPL_H_INCLUDED_
class AlgoImpl
{
protected:
AlgoImpl();
~AlgoImpl();
void computeSomeImpl();
void computeMoreImpl();
};
#endif
// Algo.cpp
...
void Algo::computeSome()
{
computeSomeImpl();
}
void Algo::computeMore()
{
computeMoreImpl();
}
// Algo_SSE.cpp
AlgoImpl::AlgoImpl()
{
}
AlgoImpl::~AlgoImpl()
{
}
void AlgoImpl::computeSomeImpl()
{
}
void AlgoImpl::computeMoreImpl()
{
}
Run Code Online (Sandbox Code Playgroud)
优点:
#ifdef整个地方更清洁private inheritance == is implemented in terms of缺点:
3.基本上是方法2,但在AlgoImpl类中有虚函数.这将允许我在需要时通过在基类中提供空实现并在派生中覆盖来克服普通C++代码的重复实现,尽管在实际实现优化版本时我将不得不禁用该行为.虚函数也会给我班级的对象带来一些"开销".
4.通过enable_if <>进行标签调度的形式
优点:
缺点:
对于任何变体我还无法弄清楚的是如何正确而干净地回退到普通的C++实现.
此外,我不想过度设计事物,在这方面,第一个变体似乎是最"喜欢",甚至考虑到了缺点.
您可以使用基于策略的方法,其模板类似于标准库对分配器,比较器等的方式.每个实现都有一个策略类,它定义了computeSome()和computeMore().您的Algo类将策略作为参数,并遵循其实现.
template <class policy_t>
class algo_with_policy_t {
policy_t policy_;
public:
algo_with_policy_t() { }
~algo_with_policy_t() { }
void process()
{
policy_.computeSome();
policy_.computeMore();
}
};
struct altivec_policy_t {
void computeSome();
void computeMore();
};
struct sse_policy_t {
void computeSome();
void computeMore();
};
struct scalar_policy_t {
void computeSome();
void computeMore();
};
// let user select exact implementation
typedef algo_with_policy_t<altivec_policy_t> algo_altivec_t;
typedef algo_with_policy_t<sse_policy_t> algo_sse_t;
typedef algo_with_policy_t<scalar_policy_t> algo_scalar_t;
// let user have default implementation
typedef
#if defined(ALTIVEC)
algo_altivec_t
#elif defined(SSE)
algo_sse_t
#else
algo_scalar_t
#endif
algo_default_t;
Run Code Online (Sandbox Code Playgroud)
这使您可以在同一个文件中定义所有不同的实现(如解决方案1)并编译到同一个程序中(与解决方案1不同).它没有性能开销(与虚函数不同).您可以在运行时选择实现,也可以获取编译时配置选择的默认实现.
template <class algo_t>
void use_algo(algo_t algo)
{
algo.process();
}
void select_algo(bool use_scalar)
{
if (!use_scalar) {
use_algo(algo_default_t());
} else {
use_algo(algo_scalar_t());
}
}
Run Code Online (Sandbox Code Playgroud)
根据评论中的要求,以下是我所做的总结:
policy_list帮助模板实用程序这维护了一个策略列表,并在调用第一个合适的实现之前给它们一个“运行时检查”调用
#include <cassert>
template <typename P, typename N=void>
struct policy_list {
static void apply() {
if (P::runtime_check()) {
P::impl();
}
else {
N::apply();
}
}
};
template <typename P>
struct policy_list<P,void> {
static void apply() {
assert(P::runtime_check());
P::impl();
}
};
Run Code Online (Sandbox Code Playgroud)
这些策略实现了所讨论算法的运行时测试和实际实现。对于我的实际问题 impl 采用了另一个模板参数来指定它们到底要实现什么,尽管这里的示例假设只需要实现一件事。运行时测试缓存在static bool某些(例如我使用的Altivec)测试中,测试速度非常慢。对于其他(例如 OpenCL)测试实际上是“这是函数指针吗NULL?” 在尝试使用 进行设置后dlsym()。
#include <iostream>
// runtime SSE detection (That's another question!)
extern bool have_sse();
struct sse_policy {
static void impl() {
std::cout << "SSE" << std::endl;
}
static bool runtime_check() {
static bool result = have_sse();
// have_sse lives in another TU and does some cpuid asm stuff
return result;
}
};
// Runtime OpenCL detection
extern bool have_opencl();
struct opencl_policy {
static void impl() {
std::cout << "OpenCL" << std::endl;
}
static bool runtime_check() {
static bool result = have_opencl();
// have_opencl lives in another TU and does some LoadLibrary or dlopen()
return result;
}
};
struct basic_policy {
static void impl() {
std::cout << "Standard C++ policy" << std::endl;
}
static bool runtime_check() { return true; } // All implementations do this
};
Run Code Online (Sandbox Code Playgroud)
policy_list简单的示例根据ARCH_HAS_SSE预处理器宏设置两个可能的列表之一。您可以从构建脚本生成此文件,或者使用一系列typedefs,或者破解对“漏洞”的支持,这些“漏洞”policy_list在某些架构上可能是无效的,直接跳到下一个架构,而不尝试检查支持。GCC 为您设置了一些可能有帮助的预处理器宏,例如__SSE2__.
#ifdef ARCH_HAS_SSE
typedef policy_list<opencl_policy,
policy_list<sse_policy,
policy_list<basic_policy
> > > active_policy;
#else
typedef policy_list<opencl_policy,
policy_list<basic_policy
> > active_policy;
#endif
Run Code Online (Sandbox Code Playgroud)
您也可以使用它在同一平台上编译多个变体,例如 x86 上的 SSE 和非 SSE 二进制文件。
相当简单,apply()调用policy_list. 相信它将调用impl()第一个通过运行时测试的策略上的方法。
int main() {
active_policy::apply();
}
Run Code Online (Sandbox Code Playgroud)
如果您采用我之前提到的“每个操作模板”方法,它可能更像是:
int main() {
Matrix m1, m2;
Vector v1;
active_policy::apply<matrix_mult_t>(m1, m2);
active_policy::apply<vector_mult_t>(m1, v1);
}
Run Code Online (Sandbox Code Playgroud)
In that case you end up making your Matrix and Vector types aware of the policy_list in order that they can decide how/where to store the data. You can also use heuristics for this too, e.g. "small vector/matrix lives in main memory no matter what" and make the runtime_check() or another function test the appropriateness of a particular approach to a given implementation for a specific instance.
I also had a custom allocator for containers, which produced suitably aligned memory always on any SSE/Altivec enabled build, regardless of if the specific machine had support for Altivec. It was just easier that way, although it could be a typedef in a given policy and you always assume that the highest priority policy has the strictest allocator needs.
have_altivec():I've included a sample have_altivec() implementation for completeness, simply because it's the shortest and therefore most appropriate for posting here. The x86/x86_64 CPUID one is messy because you have to support the compiler specific ways of writing inline ASM. The OpenCL one is messy because we check some of the implementation limits and extensions too.
#if HAVE_SETJMP && !(defined(__APPLE__) && defined(__MACH__))
jmp_buf jmpbuf;
void illegal_instruction(int sig) {
// Bad in general - https://www.securecoding.cert.org/confluence/display/seccode/SIG32-C.+Do+not+call+longjmp%28%29+from+inside+a+signal+handler
// But actually Ok on this platform in this scenario
longjmp(jmpbuf, 1);
}
#endif
bool have_altivec()
{
volatile sig_atomic_t altivec = 0;
#ifdef __APPLE__
int selectors[2] = { CTL_HW, HW_VECTORUNIT };
int hasVectorUnit = 0;
size_t length = sizeof(hasVectorUnit);
int error = sysctl(selectors, 2, &hasVectorUnit, &length, NULL, 0);
if (0 == error)
altivec = (hasVectorUnit != 0);
#elif HAVE_SETJMP_H
void (*handler) (int sig);
handler = signal(SIGILL, illegal_instruction);
if (setjmp(jmpbuf) == 0) {
asm volatile ("mtspr 256, %0\n\t" "vand %%v0, %%v0, %%v0"::"r" (-1));
altivec = 1;
}
signal(SIGILL, handler);
#endif
return altivec;
}
Run Code Online (Sandbox Code Playgroud)
Basically you pay no penalty for platforms that can never support an implementation (the compiler generates no code for them) and only a small penalty (potentially just a very predictable by the CPU test/jmp pair if your compiler is half-decent at optimising) for platforms that could support something but don't. You pay no extra cost for platforms that the first choice implementation runs on. The details of the runtime tests vary between the technology in question.