我用 Python 编写了一个小脚本来解决具有策略梯度的各种 Gym 环境。
import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2
#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also
print(policy.weights)
DISCOUNT = 0.95 #parameter for discounting future rewards
#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)
for t in range(3000):
done = False
states = [] #lists for …
Run Code Online (Sandbox Code Playgroud) python reinforcement-learning backpropagation policy-gradient-descent
我已经为 Cartpole-VO 环境实现了 PPO。然而,它在游戏的某些迭代中并不收敛。有时它会陷入局部最优。我已经使用 TD-0 优势实现了该算法,即
A(s_t) = R(t+1) + \gamma V(S_{t+1}) - V(S_t)
这是我的代码:
def running_average(x, n):
N = n
kernel = np.ones(N)
conv_len = x.shape[0]-N
y = np.zeros(conv_len)
for i in range(conv_len):
y[i] = kernel @ x[i:i+N] # matrix multiplication operator: np.mul
y[i] /= N
return y
class ActorNetwork(nn.Module):
def __init__(self, state_dim, n_actions, learning_rate=0.0003, epsilon_clipping=0.3, update_epochs=10):
super().__init__()
self.n_actions = n_actions
self.model = nn.Sequential(
nn.Linear(state_dim, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, n_actions),
nn.Softmax(dim=-1)
).float()
self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
self.epsilon_clipping …
Run Code Online (Sandbox Code Playgroud) python machine-learning reinforcement-learning pytorch policy-gradient-descent
我已经使用 DQN 算法训练了一个 RL 代理。在 20000 集之后,我的奖励收敛了。现在当我测试这个代理时,代理总是采取相同的行动,而不管状态如何。我觉得这很奇怪。有人可以帮我弄这个吗。有没有原因,谁能想到为什么agent会这样?
奖励情节
当我测试代理时
state = env.reset()
print('State: ', state)
state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])
q_values_plotting = []
for i in range(0,action_size):
q_values_plotting.append(q_values[0][i])
plt.plot(np.arange(0,action_size),q_values_plotting)
Run Code Online (Sandbox Code Playgroud)
每次它都会给出相同的 q_values 图,即使每次初始化的状态都不同。 下面是 q_Value 图。
测试:
代码
test_rewards = []
for episode in range(1000):
terminal_state = False
state = env.reset()
episode_reward = 0
while terminal_state == False:
print('State: ', state)
state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = …
Run Code Online (Sandbox Code Playgroud)