py_environment“time_step”与“time_step_spec”不匹配

Boy*_*ang 6 tensorflow-agents

我通过 tf 代理创建了一个自定义 py 环境。但是,我无法使用 py_policy.action 验证环境或在其中采取步骤,我对 time_step_specs 中排除的内容感到困惑

我尝试通过 tf_py_environment.TFPyEnvironment 转换为 tf_py_environment 并成功地使用 tf_policy 采取行动,但我仍然对其中的差异感到困惑。

import abc
import numpy as np
from tf_agents.environments import py_environment
from tf_agents.environments import tf_environment
from tf_agents.environments import tf_py_environment
from tf_agents.environments import utils
from tf_agents.specs import array_spec
from tf_agents.environments import wrappers
from tf_agents.trajectories import time_step as ts
from tf_agents.policies import random_tf_policy
import tensorflow as tf
import tf_agents

class TicTacToe(py_environment.PyEnvironment):
   def __init__(self,n):
    super(TicTacToe,self).__init__()
    self.n = n
    self.winner = None
    self._episode_ended = False
    self.inital_state = np.zeros((n,n))
    self._state = self.inital_state
    self._observation_spec = array_spec.BoundedArraySpec(
        shape = (n,n),dtype='int32',minimum = -1,maximum = 1,name = 
'TicTacToe board state spec')
    self._action_spec = array_spec.BoundedArraySpec(
        shape = (),dtype = 'int32', minimum = 0,maximum = 8, name = 
'TicTacToe action spec')

def observation_spec(self):
    return self._observation_spec

def action_spec(self):
    return self._action_spec

def _reset(self):
    return ts.restart(self.inital_state)

def check_game_over(self):
    for i in range(self.n):
        if (sum(self._state[i,:])==self.n) or 
(sum(self._state[:,i])==self.n):
            self.winner = 1
            return True

        elif (sum(self._state[i,:])==-self.n) or 
    (sum(self._state[:,i])==-self.n):
            self.winner = -1
            return True

    if (self._state.trace()==self.n) or 
(self._state[::-1].trace()==self.n):
        self.winner = 1
        return True
    elif (self._state.trace()==-self.n) or (self._state[::-1].trace()==- 
   self.n):
        self.winner = -1
        return True

    if not (0 in self._state):
        return True

def _step(self,action):
    self._state[action//3,action%3]=1
    self._episode_ended = self.check_game_over

    if self._episode_ended==True:
        if self.winner == 1:
            reward = 1
        elif self.winner == None:
            reward = 0
        else:
            reward = -1
        return ts.termination(self._state,dtype = 'int32',reward=reward)
    else:
        return ts.transition(self._state,dtype = 'int32',reward = 
0.0,discount = 0.9)

env = TicTacToe(3)
utils.validate_py_environment(env, episodes=5)
Run Code Online (Sandbox Code Playgroud)

这是我得到的错误:

----> 1 utils.validate_py_environment(env, Episodes=5) 中的 ValueError Traceback(最近一次调用最后一次)

validate_py_environment(environment, Episodes) 中的 C:\Users\bzhang\AppData\Local\Continuum\anaconda3\lib\site-packages\tf_agents\environments\utils.py 58 raise ValueError( 59 '给定time_step:%r 与预期不匹配time_step_spec: %r' % ---> 60 (time_step, time_step_spec)) 61 62 操作 = random_policy.action(time_step).action

ValueError:给定time_step:TimeStep(step_type = array(0),奖励= array(0.,dtype = float32),折扣= array(1.,dtype = float32),观察= array([[0.,0.,0 .], [0., 0., 0.], [0., 0., 0.]])) 与预期不匹配time_step_spec:TimeStep(step_type=ArraySpec(shape=(), dtype=dtype('int32') )、name='step_type')、reward=ArraySpec(shape=()、dtype=dtype('float32')、name='reward')、discount=BoundedArraySpec(shape=()、dtype=dtype('float32') ), name='折扣', 最小值=0.0, 最大值=1.0), 观察=BoundedArraySpec(shape=(3, 3), dtype=dtype('int32'), name='TicTacToe 棋盘状态规格', 最小值=- 1,最大值=1))

Cas*_* Lo 6

您的观察结果与规范不匹配,您需要传递dtype=np.int32到 np 数组以确保类型匹配。