如何解决如何使这个Double Deep Q网络收敛到最佳策略?
(针对学校项目)我一直在为这个问题而苦苦挣扎。我设法解决了许多问题,但这使我感到困惑。
-目标:制作一个RL股票交易代理,可以根据时间序列内的价格变动做出最佳交易决策。
-设置:为了简化问题,建议我使用确定性价格变动(从正弦函数模拟,而不是真实的嘈杂价格),有2种输入:最后采取的行动由代理和指标二进制变量组成,如果第二天价格上涨,则该指标二进制变量为1,如果价格下跌则为-1(使用“完美”指标简化预测问题/随机性,该指标具有第二天价格的100%预测准确性)价钱)。输出是4种可能的操作:买入,持有,平仓和保留现金(什么都不做)。
该网络使用Keras的Functional API进行构建:深度为2个隐藏层,第一个具有ReLU激活功能,第二个隐藏层具有softmax激活,而输出层使用线性激活。我尝试了许多奖励信号:价格,投资组合价值的变化,采取每项操作后的利润/亏损(包括绝对值和百分比),但是收敛问题仍然存在(我考虑过使用反向RL /模仿学习,但我不能只能在截止日期之前开始进入)。我使用Epsilon-Greedy框架,并在每个时间步长逐步将勘探速率退火为0。我也使用Double DQN框架(但没有体验重播)。每个观察值(最后一个动作和1 / -1对)分别馈入网络(无小批量),估算奖励,并使用SGD最小化目标Q值和估算Q之间的成本(Huber) -value(从每10次迭代冻结的目标网络计算)。然后从时间序列的下一个点进行的下一个观察将通过,依此类推。
-预期的行为:该行为应该是合理的。理想情况下,当输入为-1(第二天价格下跌)和0(最后一个动作为“买入”)时,Q值输出应为代理关闭该头寸以避免损失。当它是现金并且预计价格会上涨时,应该购买以获取收益,等等。
-问题:最终实际发生的情况是,代理不断收集负奖励,或者当代理确实收集正奖励时,输出相似/相同,通常一个预测的Q值始终是不管使代理几乎总是采取相同动作的输入(似乎是最优的,但不会推广到其他观察结果),都比其他输入高。
我不确定问题出在理论的应用还是代码上(因为这是我在Python上的第一个项目)。这是代码:
# Simulated sine price:
price = []
for i in range(-100,101):
price.append(np.sin(i)+10)
df = pd.DataFrame(price)
df.columns = ['Close']
# Store Closing price separately for later use:
dt = df
# Computing the price variation (Day-over-Day return):
df['DoD return']= float
for i in range(1,len(df)):
df['DoD return'][i] = ((df['Close'][i]-df['Close'][i-1] ) / df['Close'][i-1])*100
# Replacing first missing value with the mean of the two next values:
df['DoD return'][0] = np.mean(df['DoD return'][1:3])
# Creating the next-day's "perfect" predictor:
df['Next U/D'] = int
for i in range(1,len(df)):
if df['DoD return'][i] >= 0:
df['Next U/D'][i-1] = 1
else:
df['Next U/D'][i-1] = -1
df['Next U/D'][len(df)-1] = -1 # filling last missing value
# Drop DoD from dataframe:
df = df.drop('DoD return',axis=1)
class Environment:
def __init__(self):
pass
@staticmethod # takes in 0,1,2 OR 3 for (buy,hold,close or cash) and outputs the next possible actions given that last action:
def action_space(action):
return [1,2] if action in [0,1] else [0,3]
# inputs dataframe index as self and outputs the current state for that index for each trading day
def current_state(self):
return list(df.iloc[self])
# Outputs next state given current state by taking current state's index as self:
def next_state(self):
return list(df.iloc[self + 1])
# takes in an action and a state index (line inputs in the dataframe) and calculates the difference
in the portfolio after the action:
@staticmethod
def reward2(action,state_id):
global current_equity,old_equity
if action in [0,3]: # buying or staying in cash won't make any immediate effect so equity stays the same
return 0
elif action in [1,2]: # hold,close
for _ in range(state_id,-1,-1): # finding the entry price
if orders[_] == 0: # hold buy / buy entry
entry_price = dt['Close'][_] # store entry price
current_price = dt['Close'][state_id]
p_l = (current_price - entry_price)*current_equity*fraction_invested # Profit/Loss
old_equity = current_equity
current_equity += p_l
return current_equity - old_equity
break
epsilon = 1.0 # exploration probability
decay = 0.999 # factor by which to decay epsilon with each timestep
min_epsilon = 0.01 # decay epsilon to the limit of 1% (so it doesn't go all the way to 0)
last_action = 3 # initially we start with being in cash (action #3)
gamma = 0.5 # discount rate
alpha = 0.01 # learning rate
# Values that will be updated by the algorithm:
current_equity = 10 # current monetary units available to trade
old_equity = 10 # to be later updated
fraction_invested = 0.1 # invest 10% of account at each order
# Building the DQN architecture:
def DQN():
inputs = tf.keras.layers.Input(shape=(2,),dtype='float64')
hidd_layer1 = tf.keras.layers.Dense(20,activation=tf.nn.relu dtype='float64',use_bias=True )(inputs) # 1st hidden layer with 10 neurons
hidd_layer2 = tf.keras.layers.Dense(20,activation=tf.nn.softmax,dtype='float64',use_bias=True )(hidd_layer1)
outputs = tf.keras.layers.Dense(4,activation = None,kernel_initializer = 'zeros',bias_initializer = 'zeros',use_bias=True)(hidd_layer2) # output layer with 4 outputs for each of the 4 actions
return tf.keras.Model(inputs=inputs,outputs=outputs)
# Store built Network
model = DQN()
# Make a copy of it for the Target Network (Double DQN)
target_dqn = DQN()
# Optimizer:
optimizer = tf.keras.optimizers.SGD(learning_rate = alpha)
# Choose Loss Function:
loss_function = keras.losses.Huber()
# Main training loop:
while cumulative_reward < 100:
last_action = 3 # initial state: being in cash
states_history = []
actions_history = []
rewards_history = []
Losses = []
orders = []
cumulative_reward = 0
cumul_reward = []
current_equity = 10
old_equity = 10
# This is the second input to the network: last action taken
# will be added to the original dataframe as the actions are being taken
df['Last action'] = int
df['Last action'][0] = 3 # we start by being inc cash
for state_index in range(0,len(df)): # iterate over all time-steps (states)
# get current state from the Environment:
state = np.matrix(Environment.current_state(state_index))
# epsilon-greedy:
epsilon = max(epsilon,min_epsilon) # decay epsilon to the limit of 10%
if epsilon >= np.random.rand(1)[0]:
action = np.random.choice(Environment.action_space(last_action)) # pick a random action from action space
# else take Max Q-value action (best action):
else:
state_tensor = tf.expand_dims(tf.convert_to_tensor(state),axis=0)
action_probs = model(state_tensor[0],training=False) # input state into DQN to make a Q value prediction
# Take best action (highest Q-value):
for i in Environment.action_space(last_action): # loop to find action that maximizes the estimated Q value
reduced_space = [np.array(action_probs)[0][i] for i in Environment.action_space(last_action)]
# make an array that has only the available actions given last action
# Since it's a reduced space,the indices of this list do not represent our original actions
# So we write these 2 loops to find the original index which gives us the max action
for i in range(len(reduced_space)):
if reduced_space[i] == max(reduced_space):
Max = reduced_space[i]
for i in range(4):
if np.array(action_probs)[0][i] == Max:
action = i
# Log orders:
orders.append(action)
df['Last action'][state_index+1] = action
# Apply the sampled action in our environment
# Reward of taken action:
action_reward = Environment.reward2(action,state_index)
# Next state given current state index:
next_state = np.array(Environment.next_state(state_index))
# Store states,actions and rewards:
states_history.append(state)
actions_history.append(action)
rewards_history.append(action_reward)
# Estimate future rewards of next state:
future_rewards = target_dqn.predict(tf.expand_dims(next_state,axis=0))
# Updating target Q-value: Q(s,a) = reward + gamma * max Q(s',a')
target_q_value = action_reward + gamma * tf.reduce_max(future_rewards,axis=1)
# create one-hot encoding filter with 4 columns (one for each action)
Filter = tf.one_hot(actions_history,depth=4,dtype='float64')
with tf.GradientTape() as tape:
# NN-predicted Q-value
pred_q_values = model(states_history)
# Multiplying the predicted Q-values to the mask will give us a matrix that has the Q-value for each action taken
q_action = tf.reduce_sum(tf.multiply(pred_q_values,Filter))
# Calculate loss between target Q-value annd predicted Q-value
loss = loss_function(target_q_value,q_action)
# Backpropagation:
grads = tape.gradient(loss,model.trainable_variables)
optimizer.apply_gradients(zip(grads,model.trainable_variables))
# Update the weights:
weights = model.get_weights()
model.set_weights(weights)
# Update the Target Network every 10 iterations:
if state_index%10 == 0:
target_dqn.set_weights(model.get_weights())
# Update last action taken
last_action = action
# Decay epsilon:
epsilon = epsilon*decay
# Update running reward:
cumulative_reward += action_reward
# Empty history lists
states_history = []
actions_history = []
# Log loss:
Losses.append(loss)
# Log cumulative rewards:
cumul_reward.append(cumulative_reward)
# Log current equity:
curr_equity.append(current_equity)
# Show iteration,loss and exploration rate:
print("--------------------------------------")
print("- Iteration number:",state_index)
print("- Loss is equal to:",np.array(loss))
print("- Exploration probability is:",epsilon)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。