hja*_*jab 2 parallel-processing julia
我想在Julia中使用共享内存多线程。正如Threads。@ threads宏所做的那样,我可以使用ccall(:jl_threading_run ...)来执行此操作。尽管我的代码现在可以并行运行,但我没有达到我期望的加速。
以下代码旨在作为我所采用的方法和所遇到的性能问题的一个最小示例:[编辑:稍后请参见更多最小示例]
nthreads = Threads.nthreads()
test_size = 1000000
println("STARTED with ", nthreads, " thread(s) and test size of ", test_size, ".")
# Something to be processed:
objects = rand(test_size)
# Somewhere for our results
results = zeros(nthreads)
counts = zeros(nthreads)
# A function to do some work.
function worker_fn()
work_idx = 1
my_result = results[Threads.threadid()]
while work_idx > 0
my_result += objects[work_idx]
work_idx += nthreads
if work_idx > test_size
break
end
counts[Threads.threadid()] += 1
end
end
# Call our worker function using jl_threading_run
@time ccall(:jl_threading_run, Ref{Cvoid}, (Any,), worker_fn)
# Verify that we made as many calls as we think we did.
println("\nCOUNTS:")
println("\tPer thread:\t", counts)
println("\tSum:\t\t", sum(counts))
Run Code Online (Sandbox Code Playgroud)
在i7-7700上,典型的单线程结果是:
STARTED with 1 thread(s) and test size of 1000000.
0.134606 seconds (5.00 M allocations: 76.563 MiB, 1.79% gc time)
COUNTS:
Per thread: [999999.0]
Sum: 999999.0
Run Code Online (Sandbox Code Playgroud)
并具有4个线程:
STARTED with 4 thread(s) and test size of 1000000.
0.140378 seconds (1.81 M allocations: 25.661 MiB)
COUNTS:
Per thread: [249999.0, 249999.0, 249999.0, 249999.0]
Sum: 999996.0
Run Code Online (Sandbox Code Playgroud)
多线程会减慢速度!为什么?
编辑:一个更好的最小示例可以创建@threads宏本身。
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
@time Threads.@threads for i = 1 : test_size
a[Threads.threadid()] += b[i]
calls[Threads.threadid()] += 1
end
Run Code Online (Sandbox Code Playgroud)
我错误地认为@threads宏包含在Julia中将意味着有好处。
您遇到的问题很可能是错误的共享。
您可以通过将写入的区域分开得足够远来解决此问题(这是“快速而肮脏的”实现,以显示更改的本质):
julia> function f(spacing)
test_size = 1000000
a = zeros(Threads.nthreads()*spacing)
b = rand(test_size)
calls = zeros(Threads.nthreads()*spacing)
Threads.@threads for i = 1 : test_size
@inbounds begin
a[Threads.threadid()*spacing] += b[i]
calls[Threads.threadid()*spacing] += 1
end
end
a, calls
end
f (generic function with 1 method)
julia> @btime f(1);
41.525 ms (35 allocations: 7.63 MiB)
julia> @btime f(8);
2.189 ms (35 allocations: 7.63 MiB)
Run Code Online (Sandbox Code Playgroud)
或在这样的局部变量上进行按线程累加(这是首选方法,因为它应该统一更快):
function getrange(n)
tid = Threads.threadid()
nt = Threads.nthreads()
d , r = divrem(n, nt)
from = (tid - 1) * d + min(r, tid - 1) + 1
to = from + d - 1 + (tid ? r ? 1 : 0)
from:to
end
function f()
test_size = 10^8
a = zeros(Threads.nthreads())
b = rand(test_size)
calls = zeros(Threads.nthreads())
Threads.@threads for k = 1 : Threads.nthreads()
local_a = 0.0
local_c = 0.0
for i in getrange(test_size)
for j in 1:10
local_a += b[i]
local_c += 1
end
end
a[Threads.threadid()] = local_a
calls[Threads.threadid()] = local_c
end
a, calls
end
Run Code Online (Sandbox Code Playgroud)
还要注意,您可能在具有2个物理核心(只有4个虚拟核心)的计算机上使用4个踏步,因此线程化的收益将不是线性的。
| 归档时间: |
|
| 查看次数: |
465 次 |
| 最近记录: |