多线程增加了 C++ 的时间

Cla*_*asi 2 c++ memory parallel-processing multithreading vector

我试图理解为什么增加线程数(在一定数量之后)会增加而不是减少 CPU 时间。

代码功能的总结:我有一个 main,它根据维度创建一个大向量。我用索引(0..dimension-1)填充它,然后对其进行洗牌。然后,为了进行分而治之,我对该向量进行分区,为每个线程提供一个切片。我准备了一个由解决方案向量组成的向量,将每个条目提供给线程。每个线程在其切片的每个元素上调用一个函数,并将引用传递给其准备好的解决方案。该函数只是更改输入中给定索引处的解决方案的值,然后它只是增加解决方案中的所有元素(我将其稍微增加了计算时间)。这里是代码:

#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>

using namespace std;
template<typename T>
inline double getMs(T start, T end) {
    return double(
        std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
        .count()) /
        1000;
}

void fun(const std::vector<int>& slice, int dimension, 
    vector<int>& internal_sol) {
    int size = dimension;

    for (auto const& s : slice) {
        internal_sol[s] = 45;
        internal_sol[int(s/2)] = 45;
        if (s != 0) {
            internal_sol[s - 1] = 45;
        }
        if (s != internal_sol.size() - 1) {
            internal_sol[s + 1] = 45;
        }
        for (auto& i_s : internal_sol)
            i_s += 1;
    }

}


int main(int) {

    
    std::random_device rd;
    std::mt19937 g(rd());

    unsigned int n = std::thread::hardware_concurrency();
    std::cout << n << " concurrent threads are supported.\n";

    std::srand(unsigned(std::time(nullptr)));

    int number_stops = 50000; 
    int number_transfers = 3; 
    int number_structures = 2; 

    auto dimension = number_stops * (number_transfers + 1) * number_structures;
    std::vector<int> v(dimension);
    std::iota(std::begin(v), std::end(v), 0);
    std::shuffle(std::begin(v), std::end(v), g);

    printf("stops:\t%d\ntransfers:\t%d\nstructures:\t%d\n\n", number_stops, number_transfers, number_structures);

    for (size_t np = 2; np < 17; np++) {

        size_t sz = dimension;
        size_t part = sz / np;
        auto start = std::chrono::steady_clock::now();
        std::cout << np << " threads: ";
        std::vector<std::thread> threads(np);


        auto start_ext_sol = std::chrono::steady_clock::now();
        vector<vector<int>> external_sol; //As TAUs from velociraptor
        for (size_t i = 0; i < np; i++) {
            external_sol.emplace_back(dimension, -1);
        }
        double elapsed_ext_sol = getMs(start_ext_sol, std::chrono::steady_clock::now());


        auto paraTask = [&](size_t start, size_t end, vector<int>& ext_sol) {
            for (size_t l = start; l < end; ++l)
                fun({v[l]}, dimension, ext_sol);
            for (int i = 0;i < ext_sol.size(); i++)
                ext_sol[i] = -1;
        };

        for (size_t i = 0; i < np; i++) {
            size_t start = i * part;
            size_t length = (i + 1 == np) ? sz - i * part : part;
            threads[i] =
                std::thread(paraTask, start, start + length, std::ref(external_sol[i]));
        }
        
        // Join the threads
        for (auto&& thread : threads) thread.join();

        double elapsed = getMs(start, std::chrono::steady_clock::now());

        printf("parallel completed: %.3f sec. preparing all vectors ext_sols: %.3f sec\n",
            elapsed, elapsed_ext_sol);
    }

    return 0;
}
Run Code Online (Sandbox Code Playgroud)

得到的结果是:

16 concurrent threads are supported.
stops:  50000
transfers:      3
structures:     2

2 threads: parallel completed: 6.776 sec. preparing all vectors ext_sols: 0.000 sec
3 threads: parallel completed: 5.711 sec. preparing all vectors ext_sols: 0.004 sec
4 threads: parallel completed: 5.285 sec. preparing all vectors ext_sols: 0.003 sec
5 threads: parallel completed: 4.892 sec. preparing all vectors ext_sols: 0.001 sec
6 threads: parallel completed: 4.614 sec. preparing all vectors ext_sols: 0.008 sec
7 threads: parallel completed: 4.726 sec. preparing all vectors ext_sols: 0.001 sec
8 threads: parallel completed: 4.281 sec. preparing all vectors ext_sols: 0.004 sec
9 threads: parallel completed: 4.815 sec. preparing all vectors ext_sols: 0.007 sec
10 threads: parallel completed: 4.791 sec. preparing all vectors ext_sols: 0.008 sec
11 threads: parallel completed: 5.594 sec. preparing all vectors ext_sols: 0.002 sec
12 threads: parallel completed: 5.411 sec. preparing all vectors ext_sols: 0.012 sec
13 threads: parallel completed: 5.213 sec. preparing all vectors ext_sols: 0.010 sec
14 threads: parallel completed: 6.159 sec. preparing all vectors ext_sols: 0.005 sec
15 threads: parallel completed: 8.353 sec. preparing all vectors ext_sols: 0.006 sec
16 threads: parallel completed: 6.769 sec. preparing all vectors ext_sols: 0.012 sec
Run Code Online (Sandbox Code Playgroud)

如您所见,时间增加了。我不明白为什么会发生这种情况。这里每个线程的维度是 450.000 个整数,因此约为 1.7 MB(每个整数占 4 个字节)。因此,根据我的理解,我们不会讨论 L1 缓存的维度。怎么了?

这是有关我的机器的一些详细信息:

  • CPU - 第 11 代 Intel(R) Core(TM) i7-11700KF @ 3.60GHz
  • 内存 - 16 GB DDR4
  • Windows 11 编译器 - MS_VS 2022

这是 CPU-Z 的内存硬件报告

在此输入图像描述

PS:我也尝试删除lopp

for (auto& i_s : internal_sol)
            i_s += 1
Run Code Online (Sandbox Code Playgroud)

从函数中,但同样的情况发生(当然时间较短)

for*_*818 6

您的观察结果与阿姆达尔定律一致,并且考虑到使用多线程的开销。

非常粗略地说,不详细说明,您总是有一些计算部分可以从添加更多线程中受益,称为P,而某些不能从添加更多线程中受益的部分称为S,以及每个线程带来的开销T

更多线程t减少P但不S增加T

 total time = P/t + S + T*t
Run Code Online (Sandbox Code Playgroud)

即使开销很小,在某些时候它也会占主导地位,因为添加更多线程不能使P/t小于0。在大量线程的限制下,您最终会增加时间

 total time (for huge t) = 0 + S + T*t
Run Code Online (Sandbox Code Playgroud)

这只是对影响运行时的不同组件的非常基本的分析。但足以看出,即使对于可以并行化良好(P> S)且开销较小(小T)的算法,也总有一个点,添加更多线程的开销将占主导地位。

阿姆达尔没有考虑开销,因此他的总时间(实际上他考虑了加速,即total time (t=0) / total time)在某个时刻饱和。然而,真正的线程总是伴随着微小的开销。为了进一步调查,为了了解开销实际上来自哪里,我会尝试测量不同大小的数据的时间。然后你可能会看到缓存的效果。

为了完整起见,我还应该提到Gustavson,他观察到并行部分P通常会随着问题大小的增加而增加,而非并行部分S通常不会。例如,无论您运行某个算法 100 次还是 10000 次迭代,从文件中读取某些配置都需要完成一次。根据这一假设P = p(N)S = constant,我们可以得出结论:增加工作负载 ,N可以提高硬件的利用率。当使用较小的向量会导致较少数量的线程的时间增加时,您可能会看到这种效果,而使用较大的向量时,您可能会看到仅较大数量的线程的时间增加。