zer*_*ord 8 matlab artificial-intelligence machine-learning bayesian-networks reinforcement-learning
我一直在尝试实现这里描述的算法,然后在同一篇论文中描述的"大动作任务"上测试它.
算法概述:
简而言之,该算法使用下面所示形式的RBM通过改变其权重来解决强化学习问题,使得网络配置的自由能等于为该状态动作对给出的奖励信号.
为了选择动作,算法在保持状态变量固定的同时执行gibbs采样.有足够的时间,这会产生具有最低自由能的动作,因此是给定状态的最高奖励.
大型行动任务概述:
作者的实施指南概述:
具有13个隐藏变量的受限Boltzmann机器在具有12位状态空间和40位动作空间的大动作任务的实例化上被训练.随机选择了13个关键状态.该网络运行了12000次,学习率从0.1到0.01,在整个培训过程中,温度从1.0到0.1呈指数级增长.每个迭代都以随机状态初始化.每个动作选择包括100次Gibbs采样迭代.
重要的遗漏细节:
我的实施:
我最初假设作者没有使用指南中描述的机制,所以我尝试在没有偏置单元的情况下训练网络.这导致了近乎机会的表现,这是我的第一个线索,即使用的某些机制必须被作者视为"显而易见",因此被省略.
我玩了上面提到的各种省略机制,并通过使用以下方式获得了我最好的结果:
但即使进行了所有这些修改,我在任务上的表现通常在12000次迭代后的平均奖励为28.
每次迭代的代码:
%%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
poshidprobs = softmax(data*vishid + hidbiases);
%%%%%%%%% END OF POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
hidstates = softmax_sample(poshidprobs);
%%%%%%%%% START ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if test
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
else
[negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
end
data(numdims+1:end) = negaction > rand(numcases,numactiondims);
if mod(batch,100) == 1
disp(poshidprobs);
disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
end
posprods = data' * poshidprobs;
poshidact = poshidprobs;
posvisact = data;
%%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if batch>5,
momentum=.9;
else
momentum=.5;
end;
%%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);
Q = -F;
action = data(numdims+1:end);
reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
if correct_action(:,(batch)) == correct_action(:,1)
reward_dataA = [reward_dataA reward];
Q_A = [Q_A Q];
else
reward_dataB = [reward_dataB reward];
Q_B = [Q_B Q];
end
reward_error = sum(reward - Q);
rewardsum = rewardsum + reward;
errsum = errsum + abs(reward_error);
error_data(ind) = reward_error;
reward_data(ind) = reward;
Q_data(ind) = Q;
vishidinc = momentum*vishidinc + ...
epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);
vishid = vishid + vishidinc;
hidbiases = hidbiases + hidbiasinc;
visbiases = visbiases + visbiasinc;
%%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Run Code Online (Sandbox Code Playgroud)
我要求的是:
所以,如果你们中的任何一个人能够使这个算法正常工作(作者声称在12000次迭代后平均得到~40个奖励),我将非常感激.
如果我的代码似乎做了明显错误的事情,那么提请注意那也将构成一个很好的答案.
我希望作者遗漏的内容对于那些比我自己有更多基于能量的学习经验的人来说确实显而易见,在这种情况下,只需指出需要包含在工作实现中的内容.
小智 1