如何使 Julia 中的多线程与线程数成比例？

Question

如何使 Julia 中的多线程与线程数成比例？

pro*_*tos 10 parallel-processing performance multithreading julia chapel

我知道关于 Julia 中多线程性能的问题已经被问过（例如这里），但它们涉及相当复杂的代码，其中可能有很多东西在起作用。

在这里，我使用 Julia v1.5.3 在多个线程上运行一个非常简单的循环，与使用例如 Chapel 运行相同的循环相比，加速似乎并没有很好地扩展。

我想知道我做错了什么，以及如何更有效地在 Julia 中运行多线程。

顺序码

using BenchmarkTools

function slow(n::Int, digits::String)
    total = 0.0
    for i in 1:n
        if !occursin(digits, string(i))
            total += 1.0 / i
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Run Code Online (Sandbox Code Playgroud)

时间：8.034s

`Threads.@threads`4 个线程上的共享内存并行性

using BenchmarkTools
using Base.Threads

function slow(n::Int, digits::String)
    total = Atomic{Float64}(0)
    @threads for i in 1:n
        if !occursin(digits, string(i))
            atomic_add!(total, 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Run Code Online (Sandbox Code Playgroud)

时间：6.938s
加速：1.2

使用4 个线程上的FLoops共享内存并行性

using BenchmarkTools
using FLoops

function slow(n::Int, digits::String)
    total = 0.0
    @floop for i in 1:n
        if !occursin(digits, string(i))
            @reduce(total += 1.0 / i)
        end
    end
    println("total = ", total)
end

@btime slow(Int64(1e8), "9")

Run Code Online (Sandbox Code Playgroud)

时间：10.850s
无加速：比顺序码慢。

对不同数量的线程（不同硬件）进行测试

我Threads.@threads在不同的机器上测试了序列和代码，并尝试了不同数量的线程。

结果如下：

线程数	加速
2	1.2
4	1.2
8	1.0（无加速）
16	0.9（代码运行时间比顺序代码长）

对于n = 1e9将最小化任何开销的相对影响的较重计算（在上面的代码中），结果非常相似：

线程数	加速
2	1.1
4	1.3
8	1.1
16	0.8（代码运行时间比顺序代码长）

比较：与Chapel相同的循环显示完美的缩放

使用 Chapel v1.23.0 运行的代码：

use Time;
var watch: Timer;
config const n = 1e8: int;
config const digits = "9";
var total = 0.0;
watch.start();
forall i in 1..n with (+ reduce total) {
  if (i: string).find(digits) == -1 then
    total += 1.0 / i;
 }
watch.stop();
writef("total = %{###.###############} in %{##.##} seconds\n",
        total, watch.elapsed());

Run Code Online (Sandbox Code Playgroud)

第一次运行（与第一次 Julia 测试的硬件相同）：

线程数	时间（秒）	加速
1	13.33	不适用
2	7.34	1.8

第二次运行（相同的硬件）：

线程数	时间（秒）	加速
1	13.59	不适用
2	6.83	2.0

第三次运行（不同的硬件）：

线程数	时间（秒）	加速
1	19.99	不适用
2	10.06	2.0
4	5.05	4.0
8	2.54	7.9
16	1.28	15.6

Answer 1

jli*_*ing 6

有人可以做出比我更详细的分析，但幼稚的 Julia 线程表现不佳的主要原因是每次迭代中的“任务”太轻。在这种情况下，使用原子锁将意味着巨大的开销，因为所有线程都过于频繁地等待锁。

由于您的 Chapel 代码正在执行 MapReduce，因此我们还可以在 Julia 中尝试并行 MapReduce：


julia> function slow(n::Int, digits::String)
           total = 0.0
           for i in 1:n
               if !occursin(digits, string(i))
                   total += 1.0 / i
               end
           end
           "total = $total"
       end
slow (generic function with 1 method)

julia> @btime slow(Int64(1e5), "9")
  6.021 ms (200006 allocations: 9.16 MiB)
"total = 9.692877792106202"

julia> using ThreadsX

julia> function slow_thread_thx(n::Int, digits::String)
           total = ThreadsX.mapreduce(+,1:n) do i
               if !occursin(digits, string(i))
                   1.0 / i
               else
                   0.0
               end
           end
           "total = $total"
       end

julia> @btime slow_thread_thx(Int64(1e5), "9")
  1.715 ms (200295 allocations: 9.17 MiB)
"total = 9.692877792106195"

Run Code Online (Sandbox Code Playgroud)

有 4 个线程。我已经用其他数量的线程进行了测试，并确认缩放比例非常线性。

顺便说一句，就像一般提示一样，您应该尽量避免在基准代码中打印，因为重复计时时它会造成混乱，而且如果您的任务很快，STDIO 可能会花费不可忽略的时间。

归档时间：	4 年，9 月前
查看次数：	699 次
最近记录：	4 年，9 月前

如何使 Julia 中的多线程与线程数成比例？

顺序码

Threads.@threads4 个线程上的共享内存并行性

使用4 个线程上的FLoops共享内存并行性

对不同数量的线程（不同硬件）进行测试

比较：与Chapel相同的循环显示完美的缩放

`Threads.@threads`4 个线程上的共享内存并行性