Cla*_*asi 2 c++ memory parallel-processing multithreading vector
我试图理解为什么增加线程数(在一定数量之后)会增加而不是减少 CPU 时间。
代码功能的总结:我有一个 main,它根据维度创建一个大向量。我用索引(0..dimension-1)填充它,然后对其进行洗牌。然后,为了进行分而治之,我对该向量进行分区,为每个线程提供一个切片。我准备了一个由解决方案向量组成的向量,将每个条目提供给线程。每个线程在其切片的每个元素上调用一个函数,并将引用传递给其准备好的解决方案。该函数只是更改输入中给定索引处的解决方案的值,然后它只是增加解决方案中的所有元素(我将其稍微增加了计算时间)。这里是代码:
#include <algorithm>
#include <random>
#include <thread>
#include <iostream>
#include <chrono>
#include <vector>
#include <numeric>
using namespace std;
template<typename T>
inline double getMs(T start, T end) {
return double(
std::chrono::duration_cast<std::chrono::milliseconds>(end - start)
.count()) /
1000;
}
void fun(const std::vector<int>& slice, int dimension,
vector<int>& internal_sol) {
int size = dimension;
for (auto const& s : slice) {
internal_sol[s] = 45;
internal_sol[int(s/2)] = 45;
if (s != 0) {
internal_sol[s - 1] = 45;
}
if (s != internal_sol.size() - 1) {
internal_sol[s + 1] = 45;
}
for (auto& i_s : internal_sol)
i_s += 1;
}
}
int main(int) {
std::random_device rd;
std::mt19937 g(rd());
unsigned int n = std::thread::hardware_concurrency();
std::cout << n << " concurrent threads are supported.\n";
std::srand(unsigned(std::time(nullptr)));
int number_stops = 50000;
int number_transfers = 3;
int number_structures = 2;
auto dimension = number_stops * (number_transfers + 1) * number_structures;
std::vector<int> v(dimension);
std::iota(std::begin(v), std::end(v), 0);
std::shuffle(std::begin(v), std::end(v), g);
printf("stops:\t%d\ntransfers:\t%d\nstructures:\t%d\n\n", number_stops, number_transfers, number_structures);
for (size_t np = 2; np < 17; np++) {
size_t sz = dimension;
size_t part = sz / np;
auto start = std::chrono::steady_clock::now();
std::cout << np << " threads: ";
std::vector<std::thread> threads(np);
auto start_ext_sol = std::chrono::steady_clock::now();
vector<vector<int>> external_sol; //As TAUs from velociraptor
for (size_t i = 0; i < np; i++) {
external_sol.emplace_back(dimension, -1);
}
double elapsed_ext_sol = getMs(start_ext_sol, std::chrono::steady_clock::now());
auto paraTask = [&](size_t start, size_t end, vector<int>& ext_sol) {
for (size_t l = start; l < end; ++l)
fun({v[l]}, dimension, ext_sol);
for (int i = 0;i < ext_sol.size(); i++)
ext_sol[i] = -1;
};
for (size_t i = 0; i < np; i++) {
size_t start = i * part;
size_t length = (i + 1 == np) ? sz - i * part : part;
threads[i] =
std::thread(paraTask, start, start + length, std::ref(external_sol[i]));
}
// Join the threads
for (auto&& thread : threads) thread.join();
double elapsed = getMs(start, std::chrono::steady_clock::now());
printf("parallel completed: %.3f sec. preparing all vectors ext_sols: %.3f sec\n",
elapsed, elapsed_ext_sol);
}
return 0;
}
Run Code Online (Sandbox Code Playgroud)
得到的结果是:
16 concurrent threads are supported.
stops: 50000
transfers: 3
structures: 2
2 threads: parallel completed: 6.776 sec. preparing all vectors ext_sols: 0.000 sec
3 threads: parallel completed: 5.711 sec. preparing all vectors ext_sols: 0.004 sec
4 threads: parallel completed: 5.285 sec. preparing all vectors ext_sols: 0.003 sec
5 threads: parallel completed: 4.892 sec. preparing all vectors ext_sols: 0.001 sec
6 threads: parallel completed: 4.614 sec. preparing all vectors ext_sols: 0.008 sec
7 threads: parallel completed: 4.726 sec. preparing all vectors ext_sols: 0.001 sec
8 threads: parallel completed: 4.281 sec. preparing all vectors ext_sols: 0.004 sec
9 threads: parallel completed: 4.815 sec. preparing all vectors ext_sols: 0.007 sec
10 threads: parallel completed: 4.791 sec. preparing all vectors ext_sols: 0.008 sec
11 threads: parallel completed: 5.594 sec. preparing all vectors ext_sols: 0.002 sec
12 threads: parallel completed: 5.411 sec. preparing all vectors ext_sols: 0.012 sec
13 threads: parallel completed: 5.213 sec. preparing all vectors ext_sols: 0.010 sec
14 threads: parallel completed: 6.159 sec. preparing all vectors ext_sols: 0.005 sec
15 threads: parallel completed: 8.353 sec. preparing all vectors ext_sols: 0.006 sec
16 threads: parallel completed: 6.769 sec. preparing all vectors ext_sols: 0.012 sec
Run Code Online (Sandbox Code Playgroud)
如您所见,时间增加了。我不明白为什么会发生这种情况。这里每个线程的维度是 450.000 个整数,因此约为 1.7 MB(每个整数占 4 个字节)。因此,根据我的理解,我们不会讨论 L1 缓存的维度。怎么了?
这是有关我的机器的一些详细信息:
这是 CPU-Z 的内存硬件报告
PS:我也尝试删除lopp
for (auto& i_s : internal_sol)
i_s += 1
Run Code Online (Sandbox Code Playgroud)
从函数中,但同样的情况发生(当然时间较短)
您的观察结果与阿姆达尔定律一致,并且考虑到使用多线程的开销。
非常粗略地说,不详细说明,您总是有一些计算部分可以从添加更多线程中受益,称为P,而某些不能从添加更多线程中受益的部分称为S,以及每个线程带来的开销T。
更多线程t减少P但不S增加T:
total time = P/t + S + T*t
Run Code Online (Sandbox Code Playgroud)
即使开销很小,在某些时候它也会占主导地位,因为添加更多线程不能使P/t小于0。在大量线程的限制下,您最终会增加时间
total time (for huge t) = 0 + S + T*t
Run Code Online (Sandbox Code Playgroud)
这只是对影响运行时的不同组件的非常基本的分析。但足以看出,即使对于可以并行化良好(P> S)且开销较小(小T)的算法,也总有一个点,添加更多线程的开销将占主导地位。
阿姆达尔没有考虑开销,因此他的总时间(实际上他考虑了加速,即total time (t=0) / total time)在某个时刻饱和。然而,真正的线程总是伴随着微小的开销。为了进一步调查,为了了解开销实际上来自哪里,我会尝试测量不同大小的数据的时间。然后你可能会看到缓存的效果。
为了完整起见,我还应该提到Gustavson,他观察到并行部分P通常会随着问题大小的增加而增加,而非并行部分S通常不会。例如,无论您运行某个算法 100 次还是 10000 次迭代,从文件中读取某些配置都需要完成一次。根据这一假设P = p(N)和S = constant,我们可以得出结论:增加工作负载 ,N可以提高硬件的利用率。当使用较小的向量会导致较少数量的线程的时间增加时,您可能会看到这种效果,而使用较大的向量时,您可能会看到仅较大数量的线程的时间增加。
| 归档时间: |
|
| 查看次数: |
211 次 |
| 最近记录: |