For*_*der 5 python optimization performance julia numba
我有一个用于仿真的程序,并且在程序中有一个功能。我已经意识到该功能会消耗大部分仿真时间。因此,我尝试首先优化该功能。功能如下
Julia版本1.1:
function fun_jul(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1)/2)*cos(n*pi*(x+1)/2);
K = length(ksi);
Z = zeros(length(x),K);
for n in 1:M
for k in 1:K
for l in 1:length(x)
Z[l,k] += (1-(n/(M+1))^2)^xi*F(n,ksi[k])*F(n,x[l]);
end
end
end
return Z
end
Run Code Online (Sandbox Code Playgroud)
我还用python + numba重写了上面的函数,以进行比较,如下所示
Python + numba
import numpy as np
from numba import prange, jit
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((len(x),K));
for n in range(1,M+1):
for k in prange(0,K):
Z[:,k] += (1-(n/(M+1))**2)**xi*F(n,ksi[k])*F(n,x);
return Z
Run Code Online (Sandbox Code Playgroud)
但是茱莉亚代码很慢,这是我的结果:
朱莉娅结果:
using BenchmarkTools
N=400; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 1.22 MiB
allocs estimate: 2
--------------
minimum time: 25.039 s (0.00% GC)
median time: 25.039 s (0.00% GC)
mean time: 25.039 s (0.00% GC)
maximum time: 25.039 s (0.00% GC)
--------------
samples: 1
evals/sample: 1
Run Code Online (Sandbox Code Playgroud)
Python结果:
N=400;a = -0.5;b = 0.5;x = np.linspace(a,b,N);cc = x;M = 2*N + 100;xi = M/40;
%timeit fun_py(M,cc,xi,x);
1.2 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
任何帮助改进julia和python + numba的代码的帮助将不胜感激。
更新
基于@Przemyslaw Szufel的回答和其他帖子,我改进了numba和julia代码。现在两者都已并行化。这是时间
Python + Numba时间:
@jit(nopython=True, parallel=True)
def fun_py(M,ksi,xi,x):
K = len(ksi);
F = lambda nn,xx: np.sin(nn*np.pi*(xx+1)/2)*np.cos(nn*np.pi*(xx+1)/2);
Z = np.zeros((K,len(x)));
for n in range(1,M+1):
pw = (1-(n/(M+1))**2)**xi; f=F(n,x)
for k in prange(0,K):
Z[k,:] = Z[k,:] + pw*F(n,ksi[k])*f;
return Z
N=1000; a=-0.5; b=0.5; x=np.linspace(a,b,N); cc=x; M = 2*N+100; xi = M/40;
%timeit fun_py(M,cc,xi,x);
733 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Run Code Online (Sandbox Code Playgroud)
朱莉娅时代
N=1000; a=-0.5; b=0.5; x=range(a,b,length=N); cc=x; M = 2*N+100; xi = M/40;
@benchmark fun_jul2(M,cc,xi,x)
BenchmarkTools.Trial:
memory estimate: 40.31 MiB
allocs estimate: 6302
--------------
minimum time: 705.470 ms (0.17% GC)
median time: 726.403 ms (0.17% GC)
mean time: 729.032 ms (1.68% GC)
maximum time: 765.426 ms (5.27% GC)
--------------
samples: 7
evals/sample: 1
Run Code Online (Sandbox Code Playgroud)
我使用以下代码在单线程上(而不是在我的计算机上的28s)下降到300ms。
您正在为Numba使用多线程。在Julia中,您应该使用并行处理(Julia对多线程支持是实验性的)。看来您的代码正在进行某种参数扫描-这样的代码很容易并行化,但通常需要对计算过程进行一些调整。
这是代码:
function fun_jul2(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
for k in 1:K
for l in 1:L
Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
Run Code Online (Sandbox Code Playgroud)
julia> fun_jul2(M,cc,xi,x) ? fun_jul(M,cc,xi,x)
true
julia> @time fun_jul2(M,cc,xi,x);
0.305269 seconds (1.81 k allocations: 6.934 MiB, 1.60% gc time)
Run Code Online (Sandbox Code Playgroud)
**编辑:具有DNF建议的多线程和入站:
function fun_jul3(M,ksi,xi,x)
F(n,x) = sin(n*pi*(x+1))/2;
K = length(ksi);
L = length(x);
Z = zeros(length(x),K);
for n in 1:M
F_im1= [F(n,ksi[k]) for k in 1:K]
F_im2 = [F(n,x[l]) for l in 1:L]
pow = (1-(n/(M+1))^2)^xi
Threads.@threads for k in 1:K
for l in 1:L
@inbounds Z[l,k] += pow*F_im1[k]*F_im2[l];
end
end
end
Z
end
Run Code Online (Sandbox Code Playgroud)
现在是运行时间(set JULIA_NUM_THREADS=4在启动Julia之前,请记住要运行或与Linux等效):
julia> fun_jul2(M,cc,xi,x) ? fun_jul3(M,cc,xi,x)
true
julia> @time fun_jul3(M,cc,xi,x);
0.051470 seconds (2.71 k allocations: 6.989 MiB)
Run Code Online (Sandbox Code Playgroud)
您也可以尝试进一步的实验与计算的并行的F_im1和F_im2。