Windows线程同步性能问题

Question

Windows线程同步性能问题

San*_*era 9 c++ multithreading windows-10

我在 Windows 下遇到线程问题。

我正在开发一个程序，可以针对不同条件运行复杂的物理模拟。假设一年中每小时模拟 8760 次。我将每个线程的这些模拟进行分组，以便每个线程运行 273 个模拟的 for 循环（平均）

我为此任务购买了 16 核（32 线程）的 AMD ryzen 9 5950x。在 Linux 上，所有线程的使用率似乎都在 98% 到 100% 之间，而在 Windows 下我得到的是：

（第一条是读取数据的I/O线程，较小的条是进程线程。红色：同步，绿色：进程，紫色：I/O）

这是来自 Visual Studio 的并发可视化工具，它告诉我 63% 的时间花费在线程同步上。据我所知，我的代码对于 Linux 和 Windows 执行都是相同的。

我尽力使对象不可变以避免出现问题，这为我的旧 8 线程 intel i7 带来了巨大的收益。然而，当线程数量增多时，就会出现这个问题。

对于线程，我尝试了自定义并行和任务流库。两者对于我想做的事情表现相同。

Windows 线程是否存在产生这种行为的基本原理？

代码的自定义并行：


    /**
     * parallel for
     * @tparam Index integer type
     * @tparam Callable function type
     * @param start start index of the loop
     * @param end final +1 index of the loop
     * @param func function to evaluate
     * @param nb_threads number of threads, if zero, it is determined automatically
     */
    template<typename Index, typename Callable>
    static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {

        // Estimate number of threads in the pool
        if (nb_threads == 0) nb_threads = getThreadNumber();

        // Size of a slice for the range functions
        Index n = end - start + 1;
        Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
        slice = std::max(slice, Index(1));

        // [Helper] Inner loop
        auto launchRange = [&func] (int k1, int k2) {
            for (Index k = k1; k < k2; k++) {
                func(k);
            }
        };

        // Create pool and launch jobs
        std::vector<std::thread> pool;
        pool.reserve(nb_threads);
        Index i1 = start;
        Index i2 = std::min(start + slice, end);

        for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
            pool.emplace_back(launchRange, i1, i2);
            i1 = i2;
            i2 = std::min(i2 + slice, end);
        }

        if (i1 < end) {
            pool.emplace_back(launchRange, i1, end);
        }

        // Wait for jobs to finish
        for (std::thread &t : pool) {
            if (t.joinable()) {
                t.join();
            }
        }
    }

Run Code Online (Sandbox Code Playgroud)

说明该问题的完整 C++ 项目已上传至此处

主要.cpp：

//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"

void fillA(Matrix& A){

    Random rnd;
    rnd.setTimeBasedSeed();

    for(int i=0; i < A.getRows(); ++i)
        for(int j=0; j < A.getRows(); ++j)
            A(i, j) = (int) rnd.randInt(0, 1000);

}


void worker(const InputData& input_data,
            OutputData& output_data,
            const std::vector<int>& time_indices,
            int thread_index){

    std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";


    for(const int& t: time_indices){

        Matrix b = input_data.getAt(t);

        Matrix A(input_data.getDim(), input_data.getDim());
        fillA(A);

        Matrix x = A * b;

        output_data.setAt(t, x);
    }

}


void process(int time_steps, int dim, int n_threads){
    InputData input_data(time_steps, dim);
    OutputData output_data(time_steps, dim);

    // correct the number of threads
    if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }

    // generate indices
    std::vector<int> time_indices = arrange<int>(time_steps);

    // compute the split of indices per core
    std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );

    // run in parallel
    ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
            // run chunk
            worker(input_data, output_data, chunks[k].indices, k );
    } );
}

int main(){

    process(8760, 5000, 0);

    return 0;
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Gug*_*ugi 11

您看到的性能问题肯定是由许多内存分配引起的，正如Matt 在他的回答中已经怀疑的那样。对此进行扩展：以下是在 64 核（128 线程）AMD Ryzen Threadripper 3990X 上运行的 Intel VTune的屏幕截图：

正如您所看到的，几乎所有的时间都花在mallocor上free，它们是从各种Matrix操作中调用的。图像的底部显示了一小部分线程的活动时间线：绿色表示线程处于非活动状态，即正在等待。通常只有一两个线程实际处于活动状态。分配和释放内存访问共享资源，导致线程相互等待。

我认为你只有两个真正的选择：

选项 1：不再动态分配

最有效的方法是重写代码以预分配所有内容并删除所有临时对象。要使其适应您的示例代码，您可以像这样替换b = input_data.getAt(t);和：x = A * b;

void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x) 
{
  for (int i = 0; i < x.getRows(); ++i) {
    for (int j = 0; j < x.getCols(); ++j) {
      x(i, j) = 0.0;
      for (int k = 0; k < A.getCols(); ++k) {
        x(i,j) += (A(i,k) * b(k,j));
      }
    }
  }
}


void getAt(int t, Matrix const & input_data, Matrix & b) {
  for (int i = 0; i < input_data.getRows(); ++i)
    b(i, 0) = input_data(i, t);
}


void worker(const InputData& input_data,
            OutputData& output_data,
            const std::vector<int>& time_indices,
            int thread_index){

    std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";

    Matrix A(input_data.getDim(), input_data.getDim());
    Matrix b(input_data.getDim(), 1);
    Matrix x(input_data.getDim(), 1);

    for (const int & t: time_indices) {
      getAt(t, input_data.getMat(), b);
      fillA(A);
      MatrixVectorProduct(A, b, x);
      output_data.setAt(t, x);
    }

    std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}

Run Code Online (Sandbox Code Playgroud)

这解决了性能问题。这是 VTune 的屏幕截图，您可以在其中看到更好的利用率：

选项 2：使用特殊分配器

另一种方法是使用不同的分配器，在多线程场景中更有效地处理内存分配和释放。我有很好的经验的一种是mimalloc（还有其他的，例如hoard或TBB中的）。您不需要修改源代码，只需按照文档中的描述链接到特定的库。

我用你的源代码尝试了mimalloc ，它在没有任何代码更改的情况下提供了接近100%的CPU利用率。我还在英特尔论坛上发现了一个类似问题的帖子，并且解决方案是相同的（使用特殊的分配器）。

补充笔记

Matrix::allocSpace()使用指向数组的指针分配内存。最好对整个矩阵使用一个连续的数组，而不是多个独立的数组。这样，所有元素在内存中都位于彼此后面，从而允许更有效的访问。
但总的来说，我建议使用专用的线性代数库（例如Eigen）而不是手动滚动矩阵实现来利用矢量化（SSE2、AVX 等）并获得高度优化的库的好处。
确保在启用优化的情况下编译代码。
如果不需要，请禁用各种交叉检查：（assert()即NDEBUG在预处理器中定义），对于 MSVC 可能是/GS-。
确保您确实安装了足够的内存。

在这种具体情况下，规定使用像 Eigen 这样的专用库主要不是因为它的手动矢量化，而是因为它使用了[表达式模板](https://en.wikipedia.org/wiki/Expression_templates)，这最大限度地减少了分配金额。 (2认同)

Answer 2

Mat*_*ans 9

你说你所有的内存都是预先分配的，但在工作函数中我看到了这一点......

Matrix b = input_data.getAt(t);

Run Code Online (Sandbox Code Playgroud)

它分配并填充一个新的矩阵b，这......

Matrix A(input_data.getDim(), input_data.getDim());

Run Code Online (Sandbox Code Playgroud)

它分配并填充一个新的矩阵A，这......

Matrix x = A * b;

Run Code Online (Sandbox Code Playgroud)

它分配并填充一个新的矩阵x。

堆是一个全局数据结构，因此您看到的线程同步时间可能是内存分配/释放函数中的争用。

这些都处于一个紧密的循环中。您应该修复此循环b以通过引用访问，并在每次迭代中重用其他 2 个矩阵。

归档时间：	3 年前
查看次数：	377 次
最近记录：	3 年前