How to speed up dynamic dispatch by 20% using computed gotos in standard C++

5 c++ optimization virtual-functions goto branch-prediction

Before you down-vote or start saying that gotoing is evil and obsolete, please read the justification of why it is viable in this case. Before you mark it as duplicate, please read the full question.

I was reading about virtual machine interpreters, when I stumbled across computed gotos . Apparently they allow significant performance improvement of certain pieces of code. The most known example is the main VM interpreter loop.

Consider a (very) simple VM like this:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int main()
{
    Opcode program[] = { // an example program that returns 10
        Opcode::INC,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::BIT_LEFT,
        Opcode::INC,
        Opcode::INC,
        Opcode::RET
    };

    int result = 0;

    for (Opcode instruction : program)
    {
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

All this VM can do is a few simple operations on one number of type int and print it. In spite of its doubtable usefullness, it illustrates the subject nonetheless.

The critical part of the VM is obviously the switch statement in the for loop. Its performance is determined by many factors, of which the most inportant ones are most certainly branch prediction and the action of jumping to the appropriate point of execution (the case labels).

There is room for optimization here. In order to speed up the execution of this loop, one might use, so called, computed gotos.

Computed Gotos

Computed gotos are a construct well known to Fortran programmers and those using a certain (non-standard) GCC extension. I do not endorse the use of any non-standard, implementation-defined, and (obviously) undefined behavior. However to illustrate the concept in question, I will use the syntax of the mentioned GCC extension.

In standard C++ we are allowed to define labels that can later be jumped to by a goto statement:

goto some_label;

some_label:
    do_something();
Run Code Online (Sandbox Code Playgroud)

Doing this isn't considered good code (and for a good reason!). Although there are good arguments against using goto (of which most are related to code maintainability) there is an application for this abominated feature. It is the improvement of performance.

Using a goto statement can be faster than a function invocation. This is because the amount of "paperwork", like setting up the stack and returning a value, that has to be done when invoking a function. Meanwhile a goto can sometimes be converted into a single jmp assembly instruction.

To exploit the full potential of goto an extension to the GCC compiler was made that allows goto to be more dynamic. That is, the label to jump to can be determined at run-time.

This extension allows one to obtain a label pointer, similar to a function pointer and gotoing to it:

    void* label_ptr = &&some_label;
    goto (*label_ptr);

some_label:
    do_something();
Run Code Online (Sandbox Code Playgroud)

This is an interesting concept that allows us to further enhance our simple VM. Instead of using a switch statement we will use an array of label pointers (a so called jump table) and than goto to the appropriate one (the opcode will be used to index the array):

// [Courtesy of Eli Bendersky][4]
// This code is licensed with the [Unlicense][5]

int interp_cgoto(unsigned char* code, int initval) {
    /* The indices of labels in the dispatch_table are the relevant opcodes
    */
    static void* dispatch_table[] = {
        &&do_halt, &&do_inc, &&do_dec, &&do_mul2,
        &&do_div2, &&do_add7, &&do_neg};
    #define DISPATCH() goto *dispatch_table[code[pc++]]

    int pc = 0;
    int val = initval;

    DISPATCH();
    while (1) {
        do_halt:
            return val;
        do_inc:
            val++;
            DISPATCH();
        do_dec:
            val--;
            DISPATCH();
        do_mul2:
            val *= 2;
            DISPATCH();
        do_div2:
            val /= 2;
            DISPATCH();
        do_add7:
            val += 7;
            DISPATCH();
        do_neg:
            val = -val;
            DISPATCH();
    }
}
Run Code Online (Sandbox Code Playgroud)

This version is about 25% faster than the one that uses a switch (the one on the linked blog post, not the one above). This is because there is only one jump performed after each operation, instead of two.

Control flow with switch: 2跳带开关 For example, if we wanted to execute Opcode::FOO and then Opcode::SOMETHING, it would look like this: 在此处输入图片说明 As you can see, there are two jumps being performed after an instruction is executed. The first one is back to the switch code and the second is to the actual instruction.

In contrary, if we would go with an array of label pointers (as a reminder, they are non-standard), we would have only one jump: 在此处输入图片说明

It is worthwhile to note that in addition to saving cycles by doing less operations, we also enhance the quality of branch prediction by eliminating the additional jump.

Now, we know that by using an array of label pointers instead of a switch we can improve the performance of our VM significantly (by about 20%). I figured that maybe this could have some other applications too.

I came to the conclusion that this technique could be used in any program that has a loop in which it sequentially indirectly dispatches some logic. A simple example of this (apart from the VM) could be invoking a virtual method on every element of a container of polymorphic objects:

std::vector<Base*> objects;
objects = get_objects();
for (auto object : objects)
{
    object->foo();
}
Run Code Online (Sandbox Code Playgroud)

Now, this has much more applications.

There is one problem though: There is nothing such as label pointers in standard C++. As such, the question is: Is there a way to simulate the behaviour of computed gotos in standard C++ that can match them in performance?.

Edit 1:

There is yet another down side to using the switch. I was reminded of it by user1937198. It is bound checking. In short, it checks if the value of the variable inside of the switch matches any of the cases. It adds redundant branching (this check is mandated by the standard).

Edit 2:

In response to cmaster, I will clarify what is my idea on reducing overhead of virtual function calls. A dirty approach to this would be to have an id in each derived instance representing its type, that would be used to index the jump table (label pointer array). The problem is that:

  1. There are no jump tables is standard C++
  2. It would require as to modify all jump tables when a new derived class is added.

我将很感激,如果有人想出了某种类型的模板魔术(或万不得已的宏),那将使其编写得更加干净,可扩展和自动化,如下所示:

use*_*198 5

在 MSVC 的最新版本中,关键是为优化器提供所需的提示,以便优化器可以知道仅对跳转表建立索引是一种安全的转换。原始代码有两个约束阻止了这种情况,从而使得对计算标签代码生成的代码的优化成为无效变换。

首先,在原始代码中,如果程序计数器溢出程序,则循环退出。在计算的标签代码中,调用未定义的行为(取消引用超出范围的索引)。因此,编译器必须为此插入一个检查,使其为循环头生成一个基本块,而不是将其内联到每个开关块中。

其次,在原始代码中,没有处理默认情况。虽然 switch 覆盖了所有枚举值,因此没有分支匹配是未定义的行为,但 msvc 优化器不够智能,无法利用这一点,因此会生成一个不执行任何操作的默认情况。检查此默认情况需要条件,因为它处理很大范围的值。在这种情况下,计算出的 goto 代码也会调用未定义的行为。

第一个问题的解决方案很简单。不要使用 C++ 范围 for 循环,使用 while 循环或无条件的 for 循环。不幸的是,第二个解决方案需要平台特定的代码来告诉优化器默认是未定义的行为_assume(0),但大多数编译器(在 clang 和 gcc 中)中都存在类似的东西__builtin_unreachable(),并且当没有等效项时可以有条件地编译为空目前没有任何正确性问题。

所以结果是:

#include <iostream>

enum class Opcode
{
    HALT,
    INC,
    DEC,
    BIT_LEFT,
    BIT_RIGHT,
    RET
};

int run(Opcode* program) {
    int result = 0;
    for (int i = 0; true;i++)
    {
        auto instruction = program[i];
        switch (instruction)
        {
        case Opcode::HALT:
            break;
        case Opcode::INC:
            ++result;
            break;
        case Opcode::DEC:
            --result;
            break;
        case Opcode::BIT_LEFT:
            result <<= 1;
            break;
        case Opcode::BIT_RIGHT:
            result >>= 1;
            break;
        case Opcode::RET:
            std::cout << result;
            return 0;
        default:
            __assume(0);
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

生成的程序集可以在godbolt上验证