R中的n臂强盗模拟

Fer*_*ndo 6 simulation r reinforcement-learning

我正在使用Sutton&Barto的电子书强化学习:研究强化学习的入门.我在尝试模拟动作值页面上的结果(图表)时遇到了一些问题.

更具体地说,我如何模拟greedy每个任务的值?这本书说:

...我们可以根据经验超过1000次改进来绘制各种方法的表现和行为......

所以我想我必须跟踪探索性值,因为找到了更好的值.问题是如何使用贪婪的方法来做到这一点- 因为没有探索性的动作,我怎么知道什么是贪婪行为

感谢所有的评论和答案!

更新:请参阅我的答案代码.

Fer*_*ndo 5

我终于做对了。该EPS球员应该击败贪婪的球员,因为探索举动,如指出诠释了这本书。代码很慢,需要一些优化,但这里是:

在此处输入图片说明

get.testbed = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1){

  optimal = rnorm(arms, u, sdev.arm)
  rewards = sapply(optimal, function(x)rnorm(plays, x, sdev.rewards))

  list(optimal = optimal, rewards = rewards)
}

play.slots = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = 0.1){

  testbed = get.testbed(arms, plays, u, sdev.arm, sdev.rewards)
  optimal = testbed$optimal
  rewards = testbed$rewards

  optim.index = which.max(optimal)
  slot.rewards = rep(0, arms)
  reward.hist = rep(0, plays)
  optimal.hist = rep(0, plays)
  pulls = rep(0, arms)
  probs = runif(plays)

  # vetorizar
  for (i in 1:plays){

      ## dont use ifelse() in this case
      ## idx = ifelse(probs[i] < eps, sample(arms, 1), which.max(slot.rewards))

      idx = if (probs[i] < eps) sample(arms, 1) else which.max(slot.rewards)
      reward.hist[i] = rewards[i, idx]

      if (idx == optim.index)
        optimal.hist[i] = 1

      slot.rewards[idx] = slot.rewards[idx] + (rewards[i, idx] - slot.rewards[idx])/(pulls[idx] + 1)
      pulls[idx] = pulls[idx] + 1
  }

  list(slot.rewards = slot.rewards, reward.hist = reward.hist, optimal.hist = optimal.hist, pulls = pulls)
}

do.simulation = function(N = 100, arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = c(0.0, 0.01, 0.1)){

  n.players = length(eps)
  col.names = paste('eps', eps)
  rewards.hist = matrix(0, nrow = plays, ncol = n.players)
  optim.hist = matrix(0, nrow = plays, ncol = n.players)
  colnames(rewards.hist) = col.names
  colnames(optim.hist) = col.names

  for (p in 1:n.players){
    for (i in 1:N){
      play.results = play.slots(arms, plays, u, sdev.arm, sdev.rewards, eps[p])
      rewards.hist[, p] = rewards.hist[, p] + play.results$reward.hist
      optim.hist[, p] = optim.hist[, p] + play.results$optimal.hist
    } 
  }

  rewards.hist = rewards.hist/N
  optim.hist = optim.hist/N
  optim.hist = apply(optim.hist, 2, function(x)cumsum(x)/(1:plays))

  ### Plot helper ###
  plot.result = function(x, n.series, colors, leg.names, ...){
    for (i in 1:n.series){
      if (i == 1)
        plot.ts(x[, i], ylim = 2*range(x), col = colors[i], ...)
      else
        lines(x[, i], col = colors[i], ...)
      grid(col = 'lightgray')
    }
    legend('topleft', leg.names, col = colors, lwd = 2, cex = 0.6, box.lwd = NA)
  }
  ### Plot helper ###

  #### Plots ####
  require(RColorBrewer)
  colors = brewer.pal(n.players + 3, 'Set2')
  op <-par(mfrow = c(2, 1), no.readonly = TRUE)

  plot.result(rewards.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Average reward', lwd = 2)
  plot.result(optim.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Optimal move %', lwd = 2)
  #### Plots ####

  par(op)
}
Run Code Online (Sandbox Code Playgroud)

要运行它只需调用

do.simulation(N = 100, arms = 10, eps = c(0, 0.01, 0.1))
Run Code Online (Sandbox Code Playgroud)