为什么当我将 Cartpole 环境换成我自己的更简单的环境时，TensorFlow Agents 的内置 DQN 教程无法学习？

如何解决为什么当我将 Cartpole 环境换成我自己的更简单的环境时，TensorFlow Agents 的内置 DQN 教程无法学习？

我正在尝试训练一个 DQN 代理，该代理几乎完全按照 TensorFlow 代理的 DQN tutorial 建模。我希望它学习一个简单的游戏，而不是cartpole，电池可以买卖电力，因为价格每 12 个时间步长在 1 到 2 之间变化（十二个 1、十二个 2、十二个 1，...）。电池可容纳 10 个单位的电量。最优策略是在价格为 1 时买入，在价格为 2 时卖出。我所做的只是添加这个单元格来导入我写的环境：

#import environment
from storage_environment import StorageEnvironment

# define price signal and max charges
price_signal = ([1] * 6 + [2] * 12 + [1] * 6) * 365
price_signal = [P*1 for p in price_signal]
max_charge = 10

#load environment
train_py_env = StorageEnvironment(price_signal,max_charge)
eval_py_env = StorageEnvironment(price_signal,max_charge)
train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

这里是环境：

from tf_agents.environments import py_environment
from tf_agents.specs import array_spec
import numpy as np
from tf_agents.trajectories import time_step as ts
from matplotlib import pyplot as plt
import random

# The class for our environment
# The base class is the standard PyEnvironment
# In practice,we'll use a wrapper to convert this to a TensorFlow environment
class StorageEnvironment(py_environment.PyEnvironment):
    
    # price_signal: a list of prices,one for each timestep. The length of the episodes will be determined
    # by the length of this signal
    #
    # max_charge: the maximum charge of the battery
    def __init__(self,price_signal,max_charge):
        
        # Add the price signal and max charge as attributes
        self._price_signal = price_signal
        self._max_charge = max_charge
        
        # Keep track of the timestep
        self._timestep = 0
        
        # The charge begins at 0
        self._charge = 0
        
        # The balance and value begin at 0
        self._balance = 0
        self._value = 0
        
        # Actions are integers between 0 and 2
        self._action_spec = array_spec.BoundedArraySpec(
            shape = (),dtype = np.int,minimum = 0,maximum = 2,name = 'action'
        )
        # Observations are floating-point vectors of length 2
        # The first element is the current price signal (min: 0,max: inf)
        # The second element is the current battery charge (min: 0 max: max_charge)
        self._observation_spec = array_spec.BoundedArraySpec(
            shape = (2,),dtype = np.float,minimum = [0,0],maximum = [np.inf,self._max_charge],name = 'observation'
        )
    
    # required implementation for inheiritance
    def action_spec(self):
        return self._action_spec
    
    # required implementation for inheiritance
    def observation_spec(self):
        return self._observation_spec
    
    # Reset environment - required for inheiritance
    def _reset(self):
        # Set timestep to 0
        self._timestep = 0
        
        # Set price to first element of price signal
        self._current_price = self._price_signal[self._timestep]
        
        # Set charge to 0
        self._charge = 0
        
        # Set balance and value to 0
        self._balance = 0
        self._value = 0
        
        # Restart environment
        return ts.restart(
            observation = np.array([self._current_price,self._charge],dtype = np.float)
        )
    
    # Take a step with an action (integer from 0 to 2)
    def _step(self,action):
        
        # If the last step was the final time step,ignore action and reset environment
        if self._current_time_step.is_last():
            return self.reset()
        
        # 1 -> idle
        # No reward and charge doesn't change
        if action == 1:
            pass
            
        # 0 -> discharge
        elif action == 0:
            if self._charge > 0:
                self._charge -= 1
                self._balance += self._current_price
                
        # 2 -> charge
        elif action == 2:
            if self._charge < self._max_charge:
                self._charge += 1
                self._balance -= self._current_price

        else:
            raise ValueError('action should be 0,1,or 2')
        
        # Calculate reward
        # In practice,reward is equal to the change in the value of the energy currently stored by the battery
        self._new_value = self._balance + self._current_price*self._charge
        self._reward = self._new_value - self._value
        self._value = self._new_value
        
        # Alternatively:
        # self._reward = self._charge * (self._current_price-self._old_price)
            
        # If we've reached the end of the price signal,terminate the episode
        if self._timestep == len(self._price_signal)-1:
            return ts.termination(
                observation = np.array([self._current_price,dtype = np.float),reward = self._reward
            )
        
        # If we've not reached the end of the price signal,transition to the next time step
        else:
            self._timestep += 1
            self._current_price = self._price_signal[self._timestep]
            
            return ts.transition(
                observation = np.array([self._current_price,reward = self._reward
            )

在 Colab 中运行 Cartpole 教程，该算法只需几百次迭代即可找到最佳策略。我还提取了 Q 值；该图显示了最后 24 个训练时间步骤：

对于我的问题，即使经过 20,000 次迭代，Q 值也很少有意义（我希望“充电”和“放电”曲线像镜像方波一样交替）：

我尝试改变网络的大小，使用不同的学习率、epsilon 值、优化器等。似乎没有什么能解决问题。即使不改变参数，每次运行看起来都不一样。

我的主要问题是：为什么算法足够健壮来解决cartpole，但无法在这个简单得多的环境中学习？我是否遗漏了一些基本的东西？