mrg*_*g95 3 c++ performance lua qt multithreading
My Qt/C++ app uses worker threads (QThread) to improve performance for users with multicore processors. Each worker's job is to manipulate some data. Each worker minds it's own business and does not need to communicate with any other workers. They also don't perform any IO operations. Perfect use case!
The use of multithreading for this workload has delightfully improved performance by many factors over.
Running on a Ryzen 9 3900X (12 cores)
However, now each worker is also tasked with passing it's data through a Lua script. So, each worker get's it's own Lua script instance (an object containing it's own lua_State). The data is passed between the native code and the Lua script through userdata in the form of pointers to these things I call "SharedObjects." All I have to do is derive from this SharedObject class and boom, Lua can talk to it!
All my Lua workload does is some basic logic and calling native functions to allocate new things that derive from SharedObject and return them. Basically, it creates a lot of SharedObjects and connects them to each other in specific ways.
When the script has a light workload the multithreaded performance stays great.
But once the script has a heavy workload the performance drops as the thread count rises above 4.
Here's the results of the tests I ran:
I don't understand why a heavy workload causes performance to get worse as thread count goes up??? I would expect it to reach a maximum and flatline....
EDIT: I created a minimal reproducible example project that perfectly simulates the problem. I compiled with MSVC2010 (as per my real application). https://github.com/MRG95/LuaThreads
Explanation of GitHub project files:
QMetaObject implementation. the function void bindObject() sets up the connection.Worker class which gets moved to it's QThread via moveToThread. The script function call happens in void doWork().构建文件夹中有一个文件testScript.lua,它是示例工作负载本身。这只是一个简单的循环,运行在tags.h 类中找到的一些方法。
请注意 DirectConnection,这意味着它不会对呼叫进行排队。
这可能是错误的。阅读有关QThread -s 的更多信息。也许你应该使用Qt::QueuedConnection
让我们假设每个都QThread运行自己的Lua 解释器和状态(您应该研究Lua解释器的源代码,但它可能有一些GIL,或者实际上需要一个)。
我们无法猜测你的源代码,但你可能想要使用每线程事件循环,让每个 Lua 解释器都在其 QThread 中,并在全局共享状态数据上使用一些细粒度的QMutex。因此,小而短的 Lua 原语将各自使用一些共享的QMutex
请记住,Qt 图形操作仅允许从主线程(连接到Linux上的Xorg服务器的线程)进行。
但我根本无法理解的是,为什么随着线程数的增加,繁重的工作负载会导致性能变得更差???
可能与CPU缓存和缓存一致性有关。当所有活动线程和进程的数量超过核心数量时,不要排除神奇的性能扩展。
这清楚地向我表明 Lua 是瓶颈
我不确定它是否正确,并且在没有看到您的源代码的情况下,我相信它可能是错误的。瓶颈可能在您自己的代码内部(您没有显示)。为了确定这一点,研究一下Lua 的源代码。
您可以使用分析工具(在 Linux 上,gprof(1)或perf(1))。如果你使用GCC编译你的 C++ 代码和 Lua 源代码,你可能需要专门调用它。