如何使这个Double Deep Q网络收敛到最佳策略？

如何解决如何使这个Double Deep Q网络收敛到最佳策略？

（针对学校项目）我一直在为这个问题而苦苦挣扎。我设法解决了许多问题，但这使我感到困惑。

-目标：制作一个RL股票交易代理，可以根据时间序列内的价格变动做出最佳交易决策。

-设置：为了简化问题，建议我使用确定性价格变动（从正弦函数模拟，而不是真实的嘈杂价格），有2种输入：最后采取的行动由代理和指标二进制变量组成，如果第二天价格上涨，则该指标二进制变量为1，如果价格下跌则为-1（使用“完美”指标简化预测问题/随机性，该指标具有第二天价格的100％预测准确性）价钱）。输出是4种可能的操作：买入，持有，平仓和保留现金（什么都不做）。

该网络使用Keras的Functional API进行构建：深度为2个隐藏层，第一个具有ReLU激活功能，第二个隐藏层具有softmax激活，而输出层使用线性激活。我尝试了许多奖励信号：价格，投资组合价值的变化，采取每项操作后的利润/亏损（包括绝对值和百分比），但是收敛问题仍然存在（我考虑过使用反向RL /模仿学习，但我不能只能在截止日期之前开始进入）。我使用Epsilon-Greedy框架，并在每个时间步长逐步将勘探速率退火为0。我也使用Double DQN框架（但没有体验重播）。每个观察值（最后一个动作和1 / -1对）分别馈入网络（无小批量），估算奖励，并使用SGD最小化目标Q值和估算Q之间的成本（Huber） -value（从每10次迭代冻结的目标网络计算）。然后从时间序列的下一个点进行的下一个观察将通过，依此类推。

-预期的行为：该行为应该是合理的。理想情况下，当输入为-1（第二天价格下跌）和0（最后一个动作为“买入”）时，Q值输出应为代理关闭该头寸以避免损失。当它是现金并且预计价格会上涨时，应该购买以获取收益，等等。

-问题：最终实际发生的情况是，代理不断收集负奖励，或者当代理确实收集正奖励时，输出相似/相同，通常一个预测的Q值始终是不管使代理几乎总是采取相同动作的输入（似乎是最优的，但不会推广到其他观察结果），都比其他输入高。

我不确定问题出在理论的应用还是代码上（因为这是我在Python上的第一个项目）。这是代码：

# Simulated sine price:
price = []
for i in range(-100,101):
    price.append(np.sin(i)+10)
df = pd.DataFrame(price)
df.columns = ['Close']

# Store Closing price separately for later use:
dt = df

# Computing the price variation (Day-over-Day return):
df['DoD return']= float
for i in range(1,len(df)):
    df['DoD return'][i] = ((df['Close'][i]-df['Close'][i-1] ) / df['Close'][i-1])*100 

# Replacing first missing value with the mean of the two next values:
df['DoD return'][0] = np.mean(df['DoD return'][1:3])

# Creating the next-day's "perfect" predictor:
df['Next U/D'] = int
for i in range(1,len(df)):
    if df['DoD return'][i] >= 0:
        df['Next U/D'][i-1] = 1
    else:
        df['Next U/D'][i-1] = -1
        
df['Next U/D'][len(df)-1] = -1 # filling last missing value

# Drop DoD from dataframe:
df = df.drop('DoD return',axis=1)

class Environment:
    
    def __init__(self):
        pass
        
    @staticmethod  # takes in 0,1,2 OR 3 for (buy,hold,close or cash) and outputs the next possible actions given that last action:
    def action_space(action):   
        return [1,2] if action in [0,1] else [0,3]
    
    # inputs dataframe index as self and outputs the current state for that index for each trading day
    def current_state(self):
        return list(df.iloc[self]) 
                       
    # Outputs next state given current state by taking current state's index as self:
    def next_state(self):
        return list(df.iloc[self + 1])

    # takes in an action and a state index (line inputs in the dataframe) and calculates the difference 
    in the portfolio after the action:
    @staticmethod
    def reward2(action,state_id):
        
        global current_equity,old_equity
        
        if action in [0,3]: # buying or staying in cash won't make any immediate effect so equity stays the same
            return 0
        
        elif action in [1,2]: # hold,close
            for _ in range(state_id,-1,-1): # finding the entry price
                
                if orders[_] == 0:  # hold buy / buy entry
                    entry_price = dt['Close'][_] # store entry price
                    current_price = dt['Close'][state_id]
                    p_l = (current_price - entry_price)*current_equity*fraction_invested # Profit/Loss
                    old_equity = current_equity
                    current_equity += p_l
                    return current_equity - old_equity
                    break

epsilon = 1.0 # exploration probability
decay = 0.999 # factor by which to decay epsilon with each timestep
min_epsilon = 0.01 # decay epsilon to the limit of 1% (so it doesn't go all the way to 0)

last_action = 3 # initially we start with being in cash (action #3)
gamma = 0.5 # discount rate
alpha = 0.01 # learning rate

# Values that will be updated by the algorithm:
current_equity = 10 # current monetary units available to trade
old_equity = 10 # to be later updated
fraction_invested = 0.1 # invest 10% of account at each order


# Building the DQN architecture:
def DQN():
    inputs = tf.keras.layers.Input(shape=(2,),dtype='float64')
    hidd_layer1 = tf.keras.layers.Dense(20,activation=tf.nn.relu dtype='float64',use_bias=True )(inputs)  # 1st hidden layer with 10 neurons
    hidd_layer2 = tf.keras.layers.Dense(20,activation=tf.nn.softmax,dtype='float64',use_bias=True )(hidd_layer1)
    outputs = tf.keras.layers.Dense(4,activation = None,kernel_initializer = 'zeros',bias_initializer = 'zeros',use_bias=True)(hidd_layer2) # output layer with 4 outputs for each of the 4 actions
    return tf.keras.Model(inputs=inputs,outputs=outputs)

# Store built Network
model = DQN()

# Make a copy of it for the Target Network (Double DQN)
target_dqn = DQN()

# Optimizer:
optimizer = tf.keras.optimizers.SGD(learning_rate = alpha)

# Choose Loss Function:
loss_function = keras.losses.Huber()

# Main training loop:
while cumulative_reward < 100:
    
    last_action = 3  # initial state: being in cash
    states_history = []
    actions_history = []
    rewards_history = []
    Losses = []
    orders = []
    cumulative_reward = 0
    cumul_reward = []
    
    current_equity = 10
    old_equity = 10
    
    # This is the second input to the network: last action taken
    # will be added to the original dataframe as the actions are being taken

    df['Last action'] = int 
    df['Last action'][0] = 3 # we start by being inc cash

    
    for state_index in range(0,len(df)): # iterate over all time-steps (states)
        
        # get current state from the Environment:
        state = np.matrix(Environment.current_state(state_index))
        
        # epsilon-greedy:
        epsilon = max(epsilon,min_epsilon) # decay epsilon to the limit of 10%
        
        if epsilon >= np.random.rand(1)[0]:
            action = np.random.choice(Environment.action_space(last_action))  # pick a random action from action space
            
        # else take Max Q-value action (best action):
        else:
            state_tensor = tf.expand_dims(tf.convert_to_tensor(state),axis=0)
            action_probs = model(state_tensor[0],training=False)  # input state into DQN to make a Q value prediction
            
            # Take best action (highest Q-value):
            for i in Environment.action_space(last_action): # loop to find action that maximizes the estimated Q value
                reduced_space = [np.array(action_probs)[0][i] for i in Environment.action_space(last_action)]
            # make an array that has only the available actions given last action
            
            # Since it's a reduced space,the indices of this list do not represent our original actions
            # So we write these 2 loops to find the original index which gives us the max action
            for i in range(len(reduced_space)): 
                if reduced_space[i] == max(reduced_space):
                    Max = reduced_space[i]
                    
            for i in range(4):
                if np.array(action_probs)[0][i] == Max:
                    action = i
                 
        # Log orders:
        orders.append(action)
        
        df['Last action'][state_index+1] = action
        
        # Apply the sampled action in our environment
            
        # Reward of taken action:
        action_reward = Environment.reward2(action,state_index)
        
        # Next state given current state index:
        next_state = np.array(Environment.next_state(state_index))
     
        
        # Store states,actions and rewards:
        states_history.append(state)
        actions_history.append(action)
        rewards_history.append(action_reward)

        # Estimate future rewards of next state:
        future_rewards = target_dqn.predict(tf.expand_dims(next_state,axis=0))
        
        
        # Updating target Q-value: Q(s,a) = reward + gamma * max Q(s',a')
        target_q_value = action_reward + gamma * tf.reduce_max(future_rewards,axis=1)
    
        # create one-hot encoding filter with 4 columns (one for each action)
        Filter = tf.one_hot(actions_history,depth=4,dtype='float64')
        
        with tf.GradientTape() as tape:
            # NN-predicted Q-value 
            pred_q_values = model(states_history)
    
            # Multiplying the predicted Q-values to the mask will give us a matrix that has the Q-value for each action taken
            q_action = tf.reduce_sum(tf.multiply(pred_q_values,Filter))
            
            # Calculate loss between target Q-value annd predicted Q-value
            loss = loss_function(target_q_value,q_action)  
      
        
        # Backpropagation:
        grads = tape.gradient(loss,model.trainable_variables)
        optimizer.apply_gradients(zip(grads,model.trainable_variables))
        
        # Update the weights:
        weights = model.get_weights()
        model.set_weights(weights)
        
        # Update the Target Network every 10 iterations:
        if state_index%10 == 0:
            target_dqn.set_weights(model.get_weights())
        
        # Update last action taken
        last_action = action
        
        # Decay epsilon:
        epsilon = epsilon*decay
        
        # Update running reward:
        cumulative_reward += action_reward
        
        # Empty history lists
        states_history = []
        actions_history = []
        
        # Log loss:
        Losses.append(loss)
        
        # Log cumulative rewards:
        cumul_reward.append(cumulative_reward)
        
        # Log current equity:
        curr_equity.append(current_equity)
        
        # Show iteration,loss and exploration rate:
        print("--------------------------------------")
        print("- Iteration number:",state_index)
        print("- Loss is equal to:",np.array(loss))
        print("- Exploration probability is:",epsilon)

如何使这个Double Deep Q网络收敛到最佳策略？

如何解决如何使这个Double Deep Q网络收敛到最佳策略？

相关推荐