用c ++快速实现简单,虚拟,观察者类型的模式?

Nic*_*rry 3 c++ enums virtual-functions micro-optimization dispatch

我正在努力尝试使用枚举和大量的宏观魔法来实现vtable的替代品,这种魔法真的开始让我的大脑混乱.我开始认为我没有走正确的道路,因为代码变得更加丑陋和丑陋,并且无论如何都不适合生产.

如何使用最少量的重定向/操作实现以下代码的模式?

它必须在标准的c ++中完成,最多17个.

class A{
    virtual void Update() = 0; // A is so pure *¬*
};

class B: public A
{
    override void Update() final
    {
        // DO B STUFF
    }
}

class C: public A
{
    override void Update() final
    {
        // DO C STUFF
    }
}

// class...

int main()
{
    std::vector<A*> vecA{};

    // Insert instances of B, C, ..., into vecA

    for(auto a: vecA) // This for will be inside a main loop
        a->Update(); // Ridiculous amount of calls per unit of time

    // Free memory
}
Run Code Online (Sandbox Code Playgroud)

PS:如果枚举,开关和宏确实是最好的选择,我想我会简单地尝试刷新我的缓存并提出更好的设计.

PSS:我知道这是微优化......哎呀,我需要纳米甚至微微优化这个(比喻说),所以我会忽略任何可能出现的功利主义反应.

Pet*_*des 7

正如第一条评论所说,这里有一个XY问题. 排序/重新排序是可以的,你有很多对象,而不是大量不同的类,并且不需要支持你的代码在编译时不知道的类型. 多态+虚拟继承是错误的选择.

Instead, use N different containers, one for each type of object, with no indirection. Letting the compiler inline B::Update() into a loop over all the B objects is much better. (For the trivial example below of incrementing one member int, my static performance analysis from looking at the asm puts it at about 24x faster on Skylake with the data hot in L1D cache. AVX2 auto-vectorization vs. call in a loop is really that huge.)

If there was some required order between the objects, including between different types of objects, then some kind of polymorphism or manual dispatching would be appropriate. (e.g. if it mattered what order you processed vecA in, keeping all the B objects separate from all the C objects wouldn't be equivalent.)


If you care about performance, you have to realize that making the source larger might simplify things for the compiler/in the asm output. Checking/dispatching based on the type of each object inside the inner loop is expensive. Using any kind of function pointer or enum to dispatch on a per-object basis can easily suffer from branch mispredicts when you have a mix of different objects.

在多个容器上单独循环有效地提升了内部循环的类型检查并让编译器进行虚拟化.(或者更好的是,首先将每个对象缩小为不需要vtable指针,枚举或函数指针,因为它的类型是静态已知的.)

为每个具有不同类型的容器写出一个单独的循环有点像在从内循环中提升类型调度后完全展开不同类型的循环.这对于编译器内联调用是必要的,如果每种类型都有很多对象,则需要这些调用.内联允许它在对象的寄存器中保持常量,实现跨多个对象的SIMD自动向量化,并简单地避免实际函数调用的开销.(调用本身和寄存器的溢出/重载.)


你是对的,如果你确实需要每个对象的调度,当你使用final覆盖时,C++虚函数是一种昂贵的方法.您需要支付相同的运行时成本,以使您的代码支持在编译时不知道的任意大小的新派生类,但不会从中获得任何好处.

Virtual dispatch only works with a level of indirection (e.g. a vector of pointers like you're using), which means you need to manage the pointed-to objects somehow, e.g. by allocating them from vector<B> poolB and vector<C> poolC. Although I'm not sure most implementations of vector<> use realloc() when they need to grow; the new/delete API doesn't have a realloc, so vector may actually copy every time it grows, instead of trying to extend the existing allocation in place. Check what your C++ implementation does, since it might suck compared to what you can do with malloc/realloc.

和BTW,这应该是可以做到的new/ delete与分配/释放没有额外开销RAII,只要所有的类都是平凡的破坏.(但请注意,unique_ptr可能会使用指针向量而失败其他优化). std::unique_ptr警告说UB是通过指向基类的指针来销毁它的,所以你可能需要自己动手.仍然,在x86-64上的gcc上,sizeof(unique_ptr<class C>)只有8,所以它只有一个指针成员.但无论如何,单独分配数以万计的微小物体很糟糕,所以首先不要这样做.


如果你确实需要某种类型的调度请求

如果对象都是相似的大小,那么你真的想要遍历对象,而不是指向对象.这将避免指针向量的额外高速缓存占用,并且它避免了无序执行必须隐藏以保持执行单元繁忙的额外指针追踪等待时间. 但是C++虚拟继承并没有提供任何符合标准的方法来获取多态性union upoly { B b; C c; } poly_array[1024]; 你可以用reinterpret_cast<>一种可能适用于x86-64 gcc的方式来解决这个问题,但你可能不应该这样做.请参阅@ BeeOnRope的后续内容:多态类型的连续存储.(也是较旧的问答:数组中对象的C++多态).

如果你需要它,最高性能的方法可能是自己构建它enum来索引一个函数指针表(switch()如果你的函数可以内联,则使用a ).如果你的函数不是内联的,那么switch()对于一堆函数调用来说case,通常不会优化到函数指针表,即使它们都具有相同的args(或没有args).您通常会获得一个跳转表到一个调用指令块,而不是实际执行间接操作call.因此每次发货都会有额外的跳跃.

C++17 std::visit with std::variant<B, C> (using non-virtual inheritance for B and C) seems to give you dispatch based on an internal enum. std::visit uses its own jump table to dispatch, even with only 2 possible types, instead of inlining them both and using a conditional branch. It also has to check for the "uninitialized" state all the time. You can get good code if you manually work around that with B *tmp = std::get_if<B>(&my_variant), and a __builtin_unreachable() to tell gcc that nullptr isn't a possibility. But at that point you might as well just roll your own struct polymorph { enum type; union { B b; C c; }; }; (with non-virtual functions) if you don't need an "uninitialized" state. Related: C++ polymorphism of a object in an array.

In this case you only have one function, so you can put a function pointer inside each object as a member. Like void (*m_update)(A* this_object). In the caller, pass a pointer to the object as a void* or A*, since it's a non-member function. The implementation of the function will reinterpret_cast<C*>(this_object). (Not dynamic_cast: we're doing our own dispatchiing, not using C++'s).

If you want to use B and C in other contexts where the function-pointer member would be taking up space for no benefit, you can keep the function pointers in a separate container instead of in the base class. So it would be for(i=0..n) funcptrs[i]( &objects[i] );. As long as your containers don't get out of sync, you're always passing a pointer to a function that knows what to do with it. Use that with union {B b; C c} objects[] (or a vector<union>).

You can use void* if you want, especially if you make a separate array of function pointers. Then the union members don't need to inherit from a common base.

You could use std::function<> to store pointers to instance member functions, but on x86-64 gcc that's a 32-byte object. It's better for your cache footprint to only use 8-byte regular function pointers and write code that knows to pass an explicit pointer equivalent to the this pointer.

Putting a function pointer in each object may take more space than an enum or uint8_t, depending on current size/alignment. A small integer index into a table of function pointers might reduce the size of each instance of your objects vs. a pointer member, especially for 64-bit targets. Smaller objects could easily be worth the couple extra instructions to index an array of function pointers, and the possibly higher mispredict penalty from an extra pointer dereference. Memory/cache misses are often a bottleneck.

I'm assuming you do have some per-instance state, even though you don't show any. If not, then a vector of ordinary function pointers to non-member functions will be much cheaper!


Overhead comparison:

I had a look at the compiler-generated asm (gcc and clang targeting x86-64) for a few ways of doing this.

Source for multiple ways of doing this + asm from x86-64 clang 5.0 on the Godbolt compiler explorer. You can flip it over to gcc, or non-x86 architectures.

class A{
    public:
    virtual void Update() = 0; // A is so pure *¬*
};

struct C : public A {
    int m_c = 0;
    public:
    void Update() override final
    {  m_c++;  }
};
int SC = sizeof(C);  // 16 bytes because of the vtable pointer

C global_c;  // to instantiate a definition for C::Update();

// not inheriting at all gives equivalent asm to making Update non-virtual
struct nonvirt_B //: public A
{
    int m_b = 0;
    void Update() //override final
    {  m_b++;  }
};
int SB = sizeof(nonvirt_B);  // only 4 bytes per object with no vtable pointer

void separate_containers(std::vector<nonvirt_B> &vecB, std::vector<C> &vecC)
{
    for(auto &b: vecB)        b.Update();
    for(auto &c: vecC)        c.Update();   
}
Run Code Online (Sandbox Code Playgroud)

clang and gcc auto-vectorize the loop over vecB with AVX2 to process 8 int elements in parallel, so if you don't bottleneck on memory bandwidth (i.e. hot in L1D cache), this loop can increment 8 elements per clock cycle. This loop runs as fast as a loop over a vector<int>; everything inlines and optimizes away and it's just a pointer increment.

The loop over vecC can only do 1 element per clock cycle, because each object is 16 bytes (8 byte vtable pointer, 4 byte int m_c, 4 bytes of padding to the next alignment boundary because the pointer has an 8B alignment requirement.) Without final, the compiler also checks the vtable pointer to see if it's actually a C before using the inlined C::Update(), otherwise it dispatches. It's like what you'd get for a loop over struct { int a,b,c,d; } vecC[SIZE]; doing vecC[i].c++;

final allowed full devirtualization, but our data is mixed with vtable pointers, so compilers just do scalar add [mem], 1 which can only run at 1 per clock (bottlenecked on 1 per clock store throughput, regardless of the size of the store if it's hot in L1D cache). This mostly defeats SIMD for this example. (With -march=skylake-avx512, gcc and clang do some ridiculous shuffling or gather/scatter that's even slower than scalar, instead of just loading/restoring the whole object and adding with a vector that only changes the int member. That's allowed because it doesn't contain any volatile or atomic members, and would run a 2 per clock with AVX2, or 4 per clock with AVX512.) Having your objects up to 12 bytes larger is a major downside if they're small and you have a lot of them.

With multiple members per object, this doesn't necessarily defeat SIMD, but it still costs space in each object, just like an enum or function pointer would.

Since you mentioned the separating axis theorem, I hope you're not planning on storing float x,y pairs in each object. Array-of-structs basically sucks for SIMD, because it needs a lot of shuffling to use the x with the y for the same object. What you want is std::vector<float> x, y or similar, so your CPU can load 4 x values into a register and 4 y values into another register. (Or 8 at a time with AVX).

See Slides: SIMD at Insomniac Games (GDC 2015) for an intro to how to structure your data for SIMD, and some more advanced stuff. See also the tag wiki for more guides. Also, the x86 tag wiki has lots of low-level x86 optimization material. Even if you don't manually vectorize anything, with separate arrays for x and y there's a good chance that the compiler can auto-vectorize for you. (Look at the asm output, or benchmark gcc -O3 -march=native vs. gcc -O3 -march=native -fno-tree-vectorize). You may need -ffast-math for some kinds of FP vectorization.


C++ virtual dispatch:

Writing it the way you do in the question, with virtual inheritance and

std::vector<A*> vecA{};

void vec_virtual_pointers() {
    for(auto a: vecA)
        a->Update();
}
Run Code Online (Sandbox Code Playgroud)

We get this loop from clang5.0 -O3 -march=skylake

   # rbx = &vecA[0]
.LBB2_1:                         # do{
    mov     rdi, qword ptr [rbx]   # load a pointer from the vector (will be the this pointer for Update())
    mov     rax, qword ptr [rdi]   # load the vtable pointer
    call    qword ptr [rax]        # memory-indirect call using the first entry in the vtable
    add     rbx, 8                 # pointers are 8 bytes
    cmp     r14, rbx
    jne     .LBB2_1              # }while(p != vecA.end())
Run Code Online (Sandbox Code Playgroud)

So the final function pointer is at the end of a chain of three dependent loads. Out-of-order execution lets this overlap between iterations (if the branch predicts correctly), but that's a lot of overhead just in total instructions for the front-end, as well as in mispredict penalty. (call [m] is 3 uops, so just the loop itself is 8 uops, and can only issue one per 2 cycles on Skylake. Call/return has overhead too. If the callee is not totally trivial, we probably don't bottleneck on store-forwarding for pushing/popping the return address. Loop with function call faster than an empty loop. (I'm not sure about the throughput of independent store/reload operations on the same address. That would normally require memory renaming, which Skylake doesn't do, to not bottleneck on that if the callee is tiny like here.)

Clang's definition for C::Update() is

C::Update():                         # @C::Update()
    inc     dword ptr [rdi + 8]
    ret
Run Code Online (Sandbox Code Playgroud)

If this needed to set up any constants before calculating something, it would be even more expensive to not have it inlined. So, with virtual dispatch, this probably runs at about one per 3 to 5 clocks, instead of about 1 member per clock, on Skylake. (Or 8 members per clock with AVX2 for non-virtual class B which doesn't waste space, and makes auto-vectorization work well.) http://agner.org/optimize/ says Skylake has one per 3 clock call throughput, so lets say 24x performance loss when the data is hot in L1D cache. Different microarchitectures will be different, of course. See the tag wiki for more x86 perf info.


Union hack:

Probably you should never use this, but you can see from the asm that it will work on x86-64 with clang and gcc. I made an array of unions, and looped over it:

union upoly {
    upoly() {}   // needs an explicit constructor for compilers not to choke
     B b;
     C c;
} poly_array[1024];

void union_polymorph() {
    upoly *p = &poly_array[0];
    upoly *endp = &poly_array[1024];
    for ( ; p != endp ; p++) {
        A *base = reinterpret_cast<A*>(p);
        base->Update();           // virtual dispatch
    }
}
Run Code Online (Sandbox Code Playgroud)

A B and C all have their vtable at the start, so I think this will generally work. We asm that's basically the same, with one less step of pointer-chasing. (I used a static array instead of a vector, since I was keeping things simple and C-like while sorting out what to cast.)

    lea     rdi, [rbx + poly_array]       ; this pointer
    mov     rax, qword ptr [rbx + poly_array]   ; load it too, first "member" is the vtable pointer
    call    qword ptr [rax]
    add     rbx, 16                       ; stride is 16 bytes per object
    cmp     rbx, 16384                    ; 16 * 1024
    jne     .LBB4_1
Run Code Online (Sandbox Code Playgroud)

This is better, and touches less memory, but it's only slightly better for overhead.


std::function from #include <functional>

It can hold any kind of callable thing. But it has even more overhead than vtable dispatch, because it's allowed to be in an error-if-used state. So the inner loop has to check every instance for that, and trap if it is. Also, sizeof(std::function<void()>); is 32 bytes (on x86-64 System V ABI).

#include <functional>
// pretty crappy: checks for being possibly unset to see if it should throw().
std::vector<std::function<void()>> vecF{};
void vec_functional() {
    for(auto f: vecF)     f();
}

                                # do {
.LBB6_2:                                # =>This Inner Loop Header: Depth=1
    mov     qword ptr [rsp + 16], 0       # store a 0 to a local on the stack?
    mov     rax, qword ptr [rbx + 16]
    test    rax, rax
    je      .LBB6_5           # throw on pointer==0  (nullptr)
    mov     edx, 2            # third arg:  2
    mov     rdi, r14          # first arg: pointer to local stack memory (r14 = rsp outside the loop)
    mov     rsi, rbx          # second arg: point to current object in the vector
    call    rax               # otherwise call into it with 2 args
    mov     rax, qword ptr [rbx + 24]    # another pointer from the std::function<>
    mov     qword ptr [rsp + 24], rax    # store it to a local
    mov     rcx, qword ptr [rbx + 16]    # load the first pointer again
    mov     qword ptr [rsp + 16], rcx
    test    rcx, rcx
    je      .LBB6_5           # check the first pointer for null again (and throw if null)
    mov     rdi, r14
    call    rax               # call through the 2nd pointer
    mov     rax, qword ptr [rsp + 16]
    test    rax, rax
    je      .LBB6_12          # optionally skip a final call
    mov     edx, 3
    mov     rdi, r14
    mov     rsi, r14
    call    rax
.LBB6_12:                               #   in Loop: Header=BB6_2 Depth=1
    add     rbx, 32
    cmp     r15, rbx
    jne     .LBB6_2

.LBB6_13:                       # return
    add     rsp, 32
    pop     rbx
    pop     r14
    pop     r15
    ret

.LBB6_5:
    call    std::__throw_bad_function_call()
    jmp     .LBB6_16
    mov     rdi, rax
    call    __clang_call_terminate
Run Code Online (Sandbox Code Playgroud)

So there are up to three call instructions unless the pointer is nullptr. This looks far worse than virtual dispatch.

It looks a bit different with clang -stdlib=libc++, instead of the default libstdc++. (https://libcxx.llvm.org/). But still three call instructions in the inner loop, with conditionals to skip them or throw.

Unless the code-gen is very different for different kinds of function<T>, it's probably not even worth looking at it for pointers to member functions if you can write a more efficient alternative.