标签: policy-gradient-descent

在强化学习的策略梯度中反向传播什么损失或奖励？

我用 Python 编写了一个小脚本来解决具有策略梯度的各种 Gym 环境。

import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
env.reset()
s_size = len(env.reset())
a_size = 2

#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also
print(policy.weights)

DISCOUNT = 0.95 #parameter for discounting future rewards

#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)

for t in range(3000):
    done = False
    states = [] #lists for …

Run Code Online (Sandbox Code Playgroud)

python reinforcement-learning backpropagation policy-gradient-descent

S26*_*673

2020 09-17

9
推荐指数

1
解决办法

549
查看次数

Cartpole-v0 的 PyTorch PPO 实现陷入局部最优

我已经为 Cartpole-VO 环境实现了 PPO。然而，它在游戏的某些迭代中并不收敛。有时它会陷入局部最优。我已经使用 TD-0 优势实现了该算法，即

A(s_t) = R(t+1) + \gamma V(S_{t+1}) - V(S_t)

这是我的代码：

def running_average(x, n):
    N = n
    kernel = np.ones(N)
    conv_len = x.shape[0]-N
    y = np.zeros(conv_len)
    for i in range(conv_len):
        y[i] = kernel @ x[i:i+N] # matrix multiplication operator: np.mul
        y[i] /= N
    return y



class ActorNetwork(nn.Module):
    def __init__(self, state_dim, n_actions, learning_rate=0.0003, epsilon_clipping=0.3, update_epochs=10):
        super().__init__()
        self.n_actions = n_actions
        self.model = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, n_actions),
            nn.Softmax(dim=-1)
        ).float()
        self.optimizer = optim.Adam(self.model.parameters(), lr=learning_rate)
        self.epsilon_clipping …

Run Code Online (Sandbox Code Playgroud)

python machine-learning reinforcement-learning pytorch policy-gradient-descent

204*_*204

2021 12-04

6
推荐指数

1
解决办法

854
查看次数

为什么我的代理在 DQN 中总是采取相同的行动 - 强化学习

我已经使用 DQN 算法训练了一个 RL 代理。在 20000 集之后，我的奖励收敛了。现在当我测试这个代理时，代理总是采取相同的行动，而不管状态如何。我觉得这很奇怪。有人可以帮我弄这个吗。有没有原因，谁能想到为什么agent会这样？

奖励情节

当我测试代理时

state = env.reset()
print('State: ', state)

state_encod = np.reshape(state, [1, state_size])
q_values = model.predict(state_encod)
action_key = np.argmax(q_values)
print(action_key)
print(index_to_action_mapping[action_key])
print(q_values[0][0])
print(q_values[0][action_key])

q_values_plotting = []
for i in range(0,action_size):
    q_values_plotting.append(q_values[0][i])


plt.plot(np.arange(0,action_size),q_values_plotting)

Run Code Online (Sandbox Code Playgroud)

每次它都会给出相同的 q_values 图，即使每次初始化的状态都不同。下面是 q_Value 图。

测试：

代码

test_rewards = []
for episode in range(1000):
    terminal_state = False
    state = env.reset()
    episode_reward = 0
    while terminal_state == False:
        print('State: ', state)
        state_encod = np.reshape(state, [1, state_size])
        q_values = model.predict(state_encod)
        action_key = …

Run Code Online (Sandbox Code Playgroud)

reinforcement-learning q-learning policy-gradient-descent

cvg*_*cvg

2019 10-11

5
推荐指数

0
解决办法

1419
查看次数