phd*_*ent 3 performance matlab parfor
我有一个非常低效的matlab代码,我需要多次运行它.代码基本上是一个大的parfor loop,我想几乎不可能绕过.
代码首先加载几个参数和4-D矩阵,然后需要进行一些插值.所有这些都需要完成5000次(因此parfor循环).
这是代码的样子.我没有取出关键成分,我尽可能地简化了代码.
load file
nsim = 5000
T = 12;
N = 1000;
cumQx = cumsum(Qx);
cumQz = cumsum(Qz);
cumQs = cumsum(Qs);
for k=1:nsim
st(k).ksim = kstar*ones(N, T);
st(k).Vsim = zeros(N,T);
st(k).Psim = zeros(N,T);
end
parfor k = 1:nsim
sysrand = rand(T, 1);
idiorand = rand(N, T);
sigmarand = rand(T,1);
xid = zeros(T, 1);
zid = zeros(N, T);
sid = zeros(T,1);
xid(1) = 8;
zid(:, 1) = 5;
sid(1) = 1;
% Initializing the simulation
simx = zeros(T,1);
zsim = ones(N,T)*zbar;
simsx = zeros(T,1);
% Construct 3-D grid using 'ndgrid'
[ks,zs] = ndgrid(kgrid,z);
for j = 2:T
sid(j) = find(cumQs(:, sid(j-1)) >= sigmarand(j), 1);
simsx(j-1) = sigmax(sid(j));
xid(j) = find(cumQx(:, xid(j-1)) >= sysrand(j), 1);
simx(j-1) = x(xid(j));
for n = 1:N
zid(n, j) = find(cumQz(:, zid(n, j-1)) >= idiorand(n, j), 1);
zsim(n,j-1) = z(zid(n, j));
end
st(k).ksim(:,j) = interpn(ks, zs , squeeze(kprime(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % K
st(k).Vsim(:,j) = interpn(ks, zs , squeeze(V(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % V
st(k).Psim(:,j) = interpn(ks, zs , squeeze(P(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % P
end;
end
Run Code Online (Sandbox Code Playgroud)
以下是运行代码所需的矩阵的链接:http://www.filedropper.com/file_406
有没有更好的方法可以大大减少计算时间?我的猜测是否定的...理想情况下,有一种方法可以对k = 1:nsim循环进行矢量化.
为了计算代码的时间和时间,我删除了parfor循环并用for循环替换它,然后使用MATLAB分析器.我用于nsims = 500测试的目的.
使用分析器,我发现代码中存在两个关键瓶颈.第一个是find()最嵌套的for循环(n-loop)中的函数.第二个是对interpn()函数的三次调用.这4行使用+ 88%的计算时间.
find由于函数调用的开销(特别是考虑到它在嵌套循环中接收的调用数),以及内置的错误检查和管理,在这种情况下该函数比理想的要慢.find用硬编码二进制搜索(下面)替换函数可以大大提高性能,而这只是取代find了n循环.使用findfor nsims = 500导致运行时间为29.8秒.使用二进制搜索,运行时间为12.1秒.注意:这只能起作用,因为您的数据已排序,此代码无法替换每个实例中的查找.编辑:在@EBH的另一个答案中使用替代方法是一种更简洁的方法.
%perform binary search (same as your find function)
searchVal=idiorand(n, j);
il = 1;
iu = sizeCumQZ; % should be defined outside the loop as size(cumQz,1)
while (il+1<iu)
lw=floor((il+iu)/2); % split the upper index
if cumQz(lw,zid(n, j-1)) >= searchVal
iu=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)
else
il=lw; % increase lower_index_a (whose x value remains less than lower bound)
end
end
if cumQz(il,zid(n, j-1))>=searchVal
zid(n,j) = il;
else
zid(n,j) = iu;
end
Run Code Online (Sandbox Code Playgroud)
interpn通过方法检查,错误管理和数据格式管理,该功能大大减慢.标准中使用的大约100行代码interpn可以减少到每次调用2行,并且通过知道我们只需要一种类型的插值并且我们的数据符合特定格式来显着提高性能.为此,我们griddedInterpolant直接使用该功能(见下文).同样nsims = 500,使用该interpn功能(仍使用未更改的find代码)的运行时间为29.8秒.使用下面的改进方法,将运行时间减少到20.4秒.
精炼方法取代了interp此处显示的调用
st(k).ksim(:,j) = interpn(ks, zs , squeeze(kprime(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % K
st(k).Vsim(:,j) = interpn(ks, zs , squeeze(V(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % V
st(k).Psim(:,j) = interpn(ks, zs , squeeze(P(:,xid(j),:,sid(j))), st(k).ksim(:,j-1),zsim(:,j-1),'linear'); % P
Run Code Online (Sandbox Code Playgroud)
更直接的电话griddedInterpolant,如下所示:
F = griddedInterpolant(ks,zs,squeeze(kprime(:,xid(j),:,sid(j))), 'linear','none');
st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
F = griddedInterpolant(ks,zs,squeeze(V(:,xid(j),:,sid(j))), 'linear','none');
st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
F = griddedInterpolant(ks,zs,squeeze(P(:,xid(j),:,sid(j))), 'linear','none');
st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
Run Code Online (Sandbox Code Playgroud)
结合二进制搜索而不是find调用griddedInterpolant而不是interpn将总运行时间减少到3.8秒,比初始运行时间提高了近8倍.
另外三点注意事项:
1)我建议遵循先前答案中概述的建议,特别是在可能的情况下远离结构并从循环中移动任何可能的东西(特别是ndgrid,因为这肯定只需要运行一次).
2)我注意到,在使用时nsims=5000,此脚本中使用的总内存接近2.5gig.如果这接近系统的总可用内存,则可能导致显着减速.在这种情况下,我建议执行较小批量的计算,保存结果,然后继续进行进一步的计算.
3)最后,在我的测试中,使用parfor实际上比使用标准for循环慢.我相信这是因为在这种情况下每个单独的循环操作相对较短,并且与指定并行作业和管理集群工作程序相关的开销超过了并行处理的收益.如果您在大型计算机群集上,但在我的单(4核)计算机上,这可能不是这种情况,parfor只会导致速度变慢.我建议您根据自己的实际工作代码测试每个案例,如果您愿意,可以按照上面的建议进行更改.
我所做的完整代码更改如下所示,这些更改不包括结构的使用或仍然建议的其他小优化的任何更改.
load file
tic;
nsim = 500
T = 12;
N = 1000;
searchVal=1;
il = 1;
iu = 1;
cumQx = cumsum(Qx);
cumQz = cumsum(Qz);
cumQs = cumsum(Qs);
sizeCumQZ = size(cumQz,1);
for k=1:nsim
st(k).ksim = kstar*ones(N, T);
st(k).Vsim = zeros(N,T);
st(k).Psim = zeros(N,T);
end
%was parfor
for k = 1:nsim
sysrand = rand(T, 1);
idiorand = rand(N, T);
sigmarand = rand(T,1);
xid = zeros(T, 1);
zid = zeros(N, T);
sid = zeros(T,1);
xid(1) = 8;
zid(:, 1) = 5;
sid(1) = 1;
% Initializing the simulation
simx = zeros(T,1);
zsim = ones(N,T)*zbar;
simsx = zeros(T,1);
% Construct 3-D grid using 'ndgrid'
[ks,zs] = ndgrid(kgrid,z);
for j = 2:T
sid(j) = find(cumQs(:, sid(j-1)) >= sigmarand(j), 1);
simsx(j-1) = sigmax(sid(j));
xid(j) = find(cumQx(:, xid(j-1)) >= sysrand(j), 1);
simx(j-1) = x(xid(j));
for n = 1:N
%perform binary search (same as your find function)
searchVal=idiorand(n, j);
il = 1;
iu = sizeCumQZ;
while (il+1<iu)
lw=floor((il+iu)/2); % split the upper index
if cumQz(lw,zid(n, j-1)) >= searchVal
iu=lw; % decrease lower_index_b (whose x value remains \geq to lower bound)
else
il=lw; % increase lower_index_a (whose x value remains less than lower bound)
end
end
if cumQz(il,zid(n, j-1))>=searchVal
zid(n,j) = il;
else
zid(n,j) = iu;
end
zsim(n,j-1) = z(zid(n,j));
end
F = griddedInterpolant(ks,zs,squeeze(kprime(:,xid(j),:,sid(j))), 'linear','none');
st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
F = griddedInterpolant(ks,zs,squeeze(V(:,xid(j),:,sid(j))), 'linear','none');
st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
F = griddedInterpolant(ks,zs,squeeze(P(:,xid(j),:,sid(j))), 'linear','none');
st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
end;
end
toc;
Run Code Online (Sandbox Code Playgroud)
编辑:随着时间的推移griddedInterpolant,我可以通过将三个插值组合成一个网格和K,V和P插值的值来实现额外的15%的速度提升.在代码的顶部,最好在循环外完成,我用以下内容替换原始网格创建:
zRange = max(z(:))-min(z(:))+1;
zzzGrid = [z;z+1*zRange;z+2*zRange];% for K, V, and P
[ksBig,zsBig] = ndgrid(kgrid,zzzGrid);
nZ = numel(z); %used below
valGrid = zeros(size(ksBig)); %used below
Run Code Online (Sandbox Code Playgroud)
并将3个调用替换为griddedInterpolant:
valGrid(:,1:nZ) = squeeze(kprime(:,xid(j),:,sid(j)));%K
valGrid(:,nZ+1:2*nZ) = squeeze(V(:,xid(j),:,sid(j)));%V
valGrid(:,2*nZ+1:3*nZ) = squeeze(P(:,xid(j),:,sid(j)));%P
F = griddedInterpolant(ksBig,zsBig,valGrid, 'linear','none');
st(k).ksim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1));
st(k).Vsim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1)+zRange);
st(k).Psim(:,j) = F(st(k).ksim(:,j-1),zsim(:,j-1)+2*zRange);
Run Code Online (Sandbox Code Playgroud)
或者如果我们可以交易一个更复杂的设置,增加19%而不是增加15%,我们可以完全移动griddedInterpolantj循环的外部.在代码的开头,按如下方式设置网格:
zRange = max(z(:))-min(z(:))+1;
zzzGrid = [z;z+1*zRange;z+2*zRange];
zzzRange = max(zzzGrid(:))-min(zzzGrid(:))+1;
zzzTGrid = [];
for j = 2:T
zzzTGrid(end+1:end+numel(zzzGrid)) = zzzGrid+(j-2)*zzzRange;
end
[ksBig,zsBig] = ndgrid(kgrid,zzzTGrid);
nZ = numel(z); %used below
valGrid = zeros(size(ksBig)); %used below
Run Code Online (Sandbox Code Playgroud)
并替换以前的griddedInterpolant如下所示:
for j = 2:T
%%%%%
%...
%Other code in the j loop
%...
%%%%%
valGrid(:,(1:nZ)+3*nZ*(j-2)) = squeeze(kprime(:,xid(j),:,sid(j)));%K
valGrid(:,(nZ+1:2*nZ)+3*nZ*(j-2)) = squeeze(V(:,xid(j),:,sid(j)));%V
valGrid(:,(2*nZ+1:3*nZ)+3*nZ*(j-2)) = squeeze(P(:,xid(j),:,sid(j)));%P
end;
F = griddedInterpolant(ksBig,zsBig,valGrid, 'linear','none');
for j = 2:T
st(k).ksim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+3*zRange*(j-2));
st(k).Vsim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+zRange+3*zRange*(j-2));
st(k).Psim(:,j) = F(stTest(k).ksim(:,j-1),zsim(:,j-1)+2*zRange+3*zRange*(j-2));
end
Run Code Online (Sandbox Code Playgroud)
如果我们想要更加挑剔(2%的速度增加),我们可以将所有调用替换为squeeze不进行错误检查的版本,mySqueeze而是:
function b = mySqueeze(a)
%Trimmed down version of squeeze, a built-in MATLAB function, has no error-managment or case optimization
siz = size(a);
siz(siz==1) = []; % Remove singleton dimensions.
siz = [siz ones(1,2-length(siz))]; % Make sure siz is at least 2-D
b = reshape(a,siz);
Run Code Online (Sandbox Code Playgroud)