当使用更多线程时，Numba 与 prange 并行化会变慢

Question

当使用更多线程时，Numba 与 prange 并行化会变慢

我尝试了一个简单的代码来使用 numba 和 prange 并行化循环。但由于某种原因，当我使用更多线程而不是更快时，它会变得更慢。为什么会发生这种情况？（CPU 锐龙 7 2700x 8 核 16 线程 3.7GHz）

\n

from numba import njit, prange,set_num_threads,get_num_threads\n@njit(parallel=True,fastmath=True)\ndef test1():\n    x=np.empty((10,10))\n    for i in prange(10):\n        for j in range(10):\n            x[i,j]=i+j\n

Run Code Online (Sandbox Code Playgroud)\n

Number of threads : 1\n897 ns \xc2\xb1 18.3 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 2\n1.68 \xc2\xb5s \xc2\xb1 262 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 3\n2.4 \xc2\xb5s \xc2\xb1 163 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 4\n4.12 \xc2\xb5s \xc2\xb1 294 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 5\n4.62 \xc2\xb5s \xc2\xb1 283 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 6\n5.01 \xc2\xb5s \xc2\xb1 145 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 7\n5.52 \xc2\xb5s \xc2\xb1 194 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 8\n4.85 \xc2\xb5s \xc2\xb1 140 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 9\n6.47 \xc2\xb5s \xc2\xb1 348 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 10\n6.88 \xc2\xb5s \xc2\xb1 120 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 11\n7.1 \xc2\xb5s \xc2\xb1 154 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 12\n7.47 \xc2\xb5s \xc2\xb1 159 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 13\n7.91 \xc2\xb5s \xc2\xb1 160 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 14\n9.04 \xc2\xb5s \xc2\xb1 472 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 15\n9.74 \xc2\xb5s \xc2\xb1 581 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\nNumber of threads : 16\n11 \xc2\xb5s \xc2\xb1 967 ns per loop (mean \xc2\xb1 std. dev. of 10 runs, 100000 loops each)\n

Run Code Online (Sandbox Code Playgroud)\n

Answer 1

Jér*_*ard 6

这是完全正常的。Numba 需要创建线程并在它们之间分配工作，以便它们可以并行执行计算。Numba 可以使用不同的线程后端。默认情况下通常是 OpenMP，默认的 OpenMP 实现应该是 IOMP（ICC/Clang 的 OpenMP 运行时），它仅尝试创建一次线程。尽管如此，在线程之间共享工作比迭代 100 个以上的值要慢得多。现代主流处理器应该能够在 0.1-0.2 us 内顺序执行 2 个嵌套循环。Numba 还应该能够展开两个循环。Numba 函数的开销通常也约为几百纳秒。Numpy 数组的分配应该比实际循环慢得多。此外，还有其他开销导致该代码在使用多个线程时显着变慢，即使以前的开销可以忽略不计。例如，错误共享会导致写入大部分被序列化，因此比在 1 个唯一线程中完成写入要慢（因为在 x86-64 平台上的 LLC 上运行的缓存行弹跳效应）。

请注意，创建线程的时间通常明显超过 1 us，因为需要系统调用。

简而言之：当要做的工作足够大并且可以有效地并行化时，请使用线程。

归档时间：	3 年，8 月前
查看次数：	903 次
最近记录：	3 年，8 月前