Why is my double buffer implementation 8x slower on Linux than Windows?

Gio*_*ani 9 c++ multithreading x86-64

I've written this implementation of a double buffer:

// ping_pong_buffer.hpp

#include <vector>
#include <mutex>
#include <condition_variable>

template <typename T>
class ping_pong_buffer {
public:

    using single_buffer_type = std::vector<T>;
    using pointer = typename single_buffer_type::pointer;
    using const_pointer = typename single_buffer_type::const_pointer;

    ping_pong_buffer(std::size_t size)
        : _read_buffer{ size }
        , _read_valid{ false }
        , _write_buffer{ size }
        , _write_valid{ false } {}

    const_pointer get_buffer_read() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return _read_valid; });
        }
        return _read_buffer.data();
    }

    void end_reading() {
        {
            std::lock_guard<std::mutex> lk(_mtx);
            _read_valid = false;
        }
        _cv.notify_one();
    }

    pointer get_buffer_write() {
        _write_valid = true;
        return _write_buffer.data();
    }

    void end_writing() {
        {
            std::unique_lock<std::mutex> lk(_mtx);
            _cv.wait(lk, [this] { return !_read_valid; });
            std::swap(_read_buffer, _write_buffer);
            std::swap(_read_valid, _write_valid);
        }
        _cv.notify_one();
    }

private:

    single_buffer_type _read_buffer;
    bool _read_valid;
    single_buffer_type _write_buffer;
    bool _write_valid;
    mutable std::mutex _mtx;
    mutable std::condition_variable _cv;

};
Run Code Online (Sandbox Code Playgroud)

Using this dummy test that performs just swaps, its performances are about 20 times worse on Linux than Windows:

#include <thread>
#include <iostream>
#include <chrono>

#include "ping_pong_buffer.hpp"

constexpr std::size_t n = 100000;

int main() {

    ping_pong_buffer<std::size_t> ppb(1);

    std::thread producer([&ppb] {
        for (std::size_t i = 0; i < n; ++i) {
            auto p = ppb.get_buffer_write();
            p[0] = i;
            ppb.end_writing();
        }
    });

    const auto t_begin = std::chrono::steady_clock::now();

    for (;;) {
        auto p = ppb.get_buffer_read();
        if (p[0] == n - 1)
            break;
        ppb.end_reading();
    }

    const auto t_end = std::chrono::steady_clock::now();

    producer.join();

    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t_end - t_begin).count() << '\n';

    return 0;

}
Run Code Online (Sandbox Code Playgroud)

Environments of the tests are:

  • Linux (Debian Stretch): Intel Xeon E5-2650 v4, GCC: 900 to 1000 ms
    • GCC flags: -O3 -pthread
  • Windows (10): Intel i7 10700K, VS2019: 45 to 55 ms
    • VS2019 flags: /O2

You may find the code in here in godbolt, with ASM output for both GCC and VS2019 with compiler flags actually used.

This huge gap has been found also in other machines and seems to be due to the OS.

Which could be the reason of this surprising difference?

UPDATE:

The test has been performed also on Linux in the same 10700K, and is still a factor 8 slower than Windows.

  • Linux (Ubuntu 18.04.5): Intel i7 10700K, GCC: 290 to 300 ms
    • GCC flags: -O3 -pthread

If the number of iterations is increased by a factor 10, I get 2900 ms.

小智 5

正如 Mike Robinson 回答的那样,这可能与 Windows 和 Linux 上的不同锁定实现有关。通过分析每个实现切换上下文的频率,我们可以快速了解该功能的开销。我可以做 Linux 配置文件,好奇是否还有其他人可以尝试在 Windows 上进行配置文件。


我在 Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz CPU 上运行 Ubuntu 18.04

我用 编译g++ -O3 -pthread -g test.cpp -o ping_pong,并记录了如何使用此命令切换上下文:sudo perf record -s -e sched:sched_switch -g --call-graph dwarf -- ./ping_pong 我使用此命令从 perf 计数中提取了一份报告:sudo perf report -n --header --stdio > linux_ping_pong_report.sched

报告很大,但我只对这一部分感兴趣,该部分显示记录了大约 200,000 次上下文切换:

# Total Lost Samples: 0
#
# Samples: 198K of event 'sched:sched_switch'
# Event count (approx.): 198860
#
Run Code Online (Sandbox Code Playgroud)

我认为这表明性能非常糟糕,因为在测试中,有一些n=100000项目被推送和弹出到双缓冲区,所以几乎每次我们调用end_reading()or 时都会进行上下文切换end_writing(),这就是我期望使用std::condition_variable.


Mik*_*son 4

像这样一个大问题可能与各自的锁定实现有关 分析器应该能够分解进程被迫等待的原因。这两个操作系统之间的锁语义和功能完全不同。