如何解决为什么当我将 Cartpole 环境换成我自己的更简单的环境时,TensorFlow Agents 的内置 DQN 教程无法学习?
我正在尝试训练一个 DQN 代理,该代理几乎完全按照 TensorFlow 代理的 DQN tutorial 建模。我希望它学习一个简单的游戏,而不是cartpole,电池可以买卖电力,因为价格每 12 个时间步长在 1 到 2 之间变化(十二个 1、十二个 2、十二个 1,...)。电池可容纳 10 个单位的电量。最优策略是在价格为 1 时买入,在价格为 2 时卖出。我所做的只是添加这个单元格来导入我写的环境:
#import environment
from storage_environment import StorageEnvironment
# define price signal and max charges
price_signal = ([1] * 6 + [2] * 12 + [1] * 6) * 365
price_signal = [P*1 for p in price_signal]
max_charge = 10
#load environment
train_py_env = StorageEnvironment(price_signal,max_charge)
eval_py_env = StorageEnvironment(price_signal,max_charge)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)
这里是环境:
from tf_agents.environments import py_environment
from tf_agents.specs import array_spec
import numpy as np
from tf_agents.trajectories import time_step as ts
from matplotlib import pyplot as plt
import random
# The class for our environment
# The base class is the standard PyEnvironment
# In practice,we'll use a wrapper to convert this to a TensorFlow environment
class StorageEnvironment(py_environment.PyEnvironment):
# price_signal: a list of prices,one for each timestep. The length of the episodes will be determined
# by the length of this signal
#
# max_charge: the maximum charge of the battery
def __init__(self,price_signal,max_charge):
# Add the price signal and max charge as attributes
self._price_signal = price_signal
self._max_charge = max_charge
# Keep track of the timestep
self._timestep = 0
# The charge begins at 0
self._charge = 0
# The balance and value begin at 0
self._balance = 0
self._value = 0
# Actions are integers between 0 and 2
self._action_spec = array_spec.BoundedArraySpec(
shape = (),dtype = np.int,minimum = 0,maximum = 2,name = 'action'
)
# Observations are floating-point vectors of length 2
# The first element is the current price signal (min: 0,max: inf)
# The second element is the current battery charge (min: 0 max: max_charge)
self._observation_spec = array_spec.BoundedArraySpec(
shape = (2,),dtype = np.float,minimum = [0,0],maximum = [np.inf,self._max_charge],name = 'observation'
)
# required implementation for inheiritance
def action_spec(self):
return self._action_spec
# required implementation for inheiritance
def observation_spec(self):
return self._observation_spec
# Reset environment - required for inheiritance
def _reset(self):
# Set timestep to 0
self._timestep = 0
# Set price to first element of price signal
self._current_price = self._price_signal[self._timestep]
# Set charge to 0
self._charge = 0
# Set balance and value to 0
self._balance = 0
self._value = 0
# Restart environment
return ts.restart(
observation = np.array([self._current_price,self._charge],dtype = np.float)
)
# Take a step with an action (integer from 0 to 2)
def _step(self,action):
# If the last step was the final time step,ignore action and reset environment
if self._current_time_step.is_last():
return self.reset()
# 1 -> idle
# No reward and charge doesn't change
if action == 1:
pass
# 0 -> discharge
elif action == 0:
if self._charge > 0:
self._charge -= 1
self._balance += self._current_price
# 2 -> charge
elif action == 2:
if self._charge < self._max_charge:
self._charge += 1
self._balance -= self._current_price
else:
raise ValueError('action should be 0,1,or 2')
# Calculate reward
# In practice,reward is equal to the change in the value of the energy currently stored by the battery
self._new_value = self._balance + self._current_price*self._charge
self._reward = self._new_value - self._value
self._value = self._new_value
# Alternatively:
# self._reward = self._charge * (self._current_price-self._old_price)
# If we've reached the end of the price signal,terminate the episode
if self._timestep == len(self._price_signal)-1:
return ts.termination(
observation = np.array([self._current_price,dtype = np.float),reward = self._reward
)
# If we've not reached the end of the price signal,transition to the next time step
else:
self._timestep += 1
self._current_price = self._price_signal[self._timestep]
return ts.transition(
observation = np.array([self._current_price,reward = self._reward
)
在 Colab 中运行 Cartpole 教程,该算法只需几百次迭代即可找到最佳策略。我还提取了 Q 值;该图显示了最后 24 个训练时间步骤:
对于我的问题,即使经过 20,000 次迭代,Q 值也很少有意义(我希望“充电”和“放电”曲线像镜像方波一样交替):
我尝试改变网络的大小,使用不同的学习率、epsilon 值、优化器等。似乎没有什么能解决问题。即使不改变参数,每次运行看起来都不一样。
我的主要问题是:为什么算法足够健壮来解决cartpole,但无法在这个简单得多的环境中学习?我是否遗漏了一些基本的东西?
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。