如何使这个Double Deep Q网络收敛到最佳策略?

如何解决如何使这个Double Deep Q网络收敛到最佳策略?

(针对学校项目)我一直在为这个问题而苦苦挣扎。我设法解决了许多问题,但这使我感到困惑。

-目标:制作一个RL股票交易代理,可以根据时间序列内的价格变动做出最佳交易决策。

-设置:为了简化问题,建议我使用确定性价格变动(从正弦函数模拟,而不是真实的嘈杂价格),有2种输入:最后采取的行动由代理和指标二进制变量组成,如果第二天价格上涨,则该指标二进制变量为1,如果价格下跌则为-1(使用“完美”指标简化预测问题/随机性,该指标具有第二天价格的100%预测准确性)价钱)。输出是4种可能的操作:买入,持有,平仓和保留现金(什么都不做)。

该网络使用Keras的Functional API进行构建:深度为2个隐藏层,第一个具有ReLU激活功能,第二个隐藏层具有softmax激活,而输出层使用线性激活。我尝试了许多奖励信号:价格,投资组合价值的变化,采取每项操作后的利润/亏损(包括绝对值和百分比),但是收敛问题仍然存在(我考虑过使用反向RL /模仿学习,但我不能只能在截止日期之前开始进入)。我使用Epsilon-Greedy框架,并在每个时间步长逐步将勘探速率退火为0。我也使用Double DQN框架(但没有体验重播)。每个观察值(最后一个动作和1 / -1对)分别馈入网络(无小批量),估算奖励,并使用SGD最小化目标Q值和估算Q之间的成本(Huber) -value(从每10次迭代冻结的目标网络计算)。然后从时间序列的下一个点进行的下一个观察将通过,依此类推。

-预期的行为:该行为应该是合理的。理想情况下,当输入为-1(第二天价格下跌)和0(最后一个动作为“买入”)时,Q值输出应为代理关闭该头寸以避免损失。当它是现金并且预计价格会上涨时,应该购买以获取收益,等等。

-问题:最终实际发生的情况是,代理不断收集负奖励,或者当代理确实收集正奖励时,输出相似/相同,通常一个预测的Q值始终是不管使代理几乎总是采取相同动作的输入(似乎是最优的,但不会推广到其他观察结果),都比其他输入高。

我不确定问题出在理论的应用还是代码上(因为这是我在Python上的第一个项目)。这是代码:

# Simulated sine price:
price = []
for i in range(-100,101):
    price.append(np.sin(i)+10)
df = pd.DataFrame(price)
df.columns = ['Close']

# Store Closing price separately for later use:
dt = df

# Computing the price variation (Day-over-Day return):
df['DoD return']= float
for i in range(1,len(df)):
    df['DoD return'][i] = ((df['Close'][i]-df['Close'][i-1] ) / df['Close'][i-1])*100 

# Replacing first missing value with the mean of the two next values:
df['DoD return'][0] = np.mean(df['DoD return'][1:3])

# Creating the next-day's "perfect" predictor:
df['Next U/D'] = int
for i in range(1,len(df)):
    if df['DoD return'][i] >= 0:
        df['Next U/D'][i-1] = 1
    else:
        df['Next U/D'][i-1] = -1
        
df['Next U/D'][len(df)-1] = -1 # filling last missing value

# Drop DoD from dataframe:
df = df.drop('DoD return',axis=1)

class Environment:
    
    def __init__(self):
        pass
        
    @staticmethod  # takes in 0,1,2 OR 3 for (buy,hold,close or cash) and outputs the next possible actions given that last action:
    def action_space(action):   
        return [1,2] if action in [0,1] else [0,3]
    
    # inputs dataframe index as self and outputs the current state for that index for each trading day
    def current_state(self):
        return list(df.iloc[self]) 
                       
    # Outputs next state given current state by taking current state's index as self:
    def next_state(self):
        return list(df.iloc[self + 1])

    # takes in an action and a state index (line inputs in the dataframe) and calculates the difference 
    in the portfolio after the action:
    @staticmethod
    def reward2(action,state_id):
        
        global current_equity,old_equity
        
        if action in [0,3]: # buying or staying in cash won't make any immediate effect so equity stays the same
            return 0
        
        elif action in [1,2]: # hold,close
            for _ in range(state_id,-1,-1): # finding the entry price
                
                if orders[_] == 0:  # hold buy / buy entry
                    entry_price = dt['Close'][_] # store entry price
                    current_price = dt['Close'][state_id]
                    p_l = (current_price - entry_price)*current_equity*fraction_invested # Profit/Loss
                    old_equity = current_equity
                    current_equity += p_l
                    return current_equity - old_equity
                    break

epsilon = 1.0 # exploration probability
decay = 0.999 # factor by which to decay epsilon with each timestep
min_epsilon = 0.01 # decay epsilon to the limit of 1% (so it doesn't go all the way to 0)

last_action = 3 # initially we start with being in cash (action #3)
gamma = 0.5 # discount rate
alpha = 0.01 # learning rate

# Values that will be updated by the algorithm:
current_equity = 10 # current monetary units available to trade
old_equity = 10 # to be later updated
fraction_invested = 0.1 # invest 10% of account at each order


# Building the DQN architecture:
def DQN():
    inputs = tf.keras.layers.Input(shape=(2,),dtype='float64')
    hidd_layer1 = tf.keras.layers.Dense(20,activation=tf.nn.relu dtype='float64',use_bias=True )(inputs)  # 1st hidden layer with 10 neurons
    hidd_layer2 = tf.keras.layers.Dense(20,activation=tf.nn.softmax,dtype='float64',use_bias=True )(hidd_layer1)
    outputs = tf.keras.layers.Dense(4,activation = None,kernel_initializer = 'zeros',bias_initializer = 'zeros',use_bias=True)(hidd_layer2) # output layer with 4 outputs for each of the 4 actions
    return tf.keras.Model(inputs=inputs,outputs=outputs)

# Store built Network
model = DQN()

# Make a copy of it for the Target Network (Double DQN)
target_dqn = DQN()

# Optimizer:
optimizer = tf.keras.optimizers.SGD(learning_rate = alpha)

# Choose Loss Function:
loss_function = keras.losses.Huber()

# Main training loop:
while cumulative_reward < 100:
    
    last_action = 3  # initial state: being in cash
    states_history = []
    actions_history = []
    rewards_history = []
    Losses = []
    orders = []
    cumulative_reward = 0
    cumul_reward = []
    
    current_equity = 10
    old_equity = 10
    
    # This is the second input to the network: last action taken
    # will be added to the original dataframe as the actions are being taken

    df['Last action'] = int 
    df['Last action'][0] = 3 # we start by being inc cash

    
    for state_index in range(0,len(df)): # iterate over all time-steps (states)
        
        # get current state from the Environment:
        state = np.matrix(Environment.current_state(state_index))
        
        # epsilon-greedy:
        epsilon = max(epsilon,min_epsilon) # decay epsilon to the limit of 10%
        
        if epsilon >= np.random.rand(1)[0]:
            action = np.random.choice(Environment.action_space(last_action))  # pick a random action from action space
            
        # else take Max Q-value action (best action):
        else:
            state_tensor = tf.expand_dims(tf.convert_to_tensor(state),axis=0)
            action_probs = model(state_tensor[0],training=False)  # input state into DQN to make a Q value prediction
            
            # Take best action (highest Q-value):
            for i in Environment.action_space(last_action): # loop to find action that maximizes the estimated Q value
                reduced_space = [np.array(action_probs)[0][i] for i in Environment.action_space(last_action)]
            # make an array that has only the available actions given last action
            
            # Since it's a reduced space,the indices of this list do not represent our original actions
            # So we write these 2 loops to find the original index which gives us the max action
            for i in range(len(reduced_space)): 
                if reduced_space[i] == max(reduced_space):
                    Max = reduced_space[i]
                    
            for i in range(4):
                if np.array(action_probs)[0][i] == Max:
                    action = i
                 
        # Log orders:
        orders.append(action)
        
        df['Last action'][state_index+1] = action
        
        # Apply the sampled action in our environment
            
        # Reward of taken action:
        action_reward = Environment.reward2(action,state_index)
        
        # Next state given current state index:
        next_state = np.array(Environment.next_state(state_index))
     
        
        # Store states,actions and rewards:
        states_history.append(state)
        actions_history.append(action)
        rewards_history.append(action_reward)

        # Estimate future rewards of next state:
        future_rewards = target_dqn.predict(tf.expand_dims(next_state,axis=0))
        
        
        # Updating target Q-value: Q(s,a) = reward + gamma * max Q(s',a')
        target_q_value = action_reward + gamma * tf.reduce_max(future_rewards,axis=1)
    
        # create one-hot encoding filter with 4 columns (one for each action)
        Filter = tf.one_hot(actions_history,depth=4,dtype='float64')
        
        with tf.GradientTape() as tape:
            # NN-predicted Q-value 
            pred_q_values = model(states_history)
    
            # Multiplying the predicted Q-values to the mask will give us a matrix that has the Q-value for each action taken
            q_action = tf.reduce_sum(tf.multiply(pred_q_values,Filter))
            
            # Calculate loss between target Q-value annd predicted Q-value
            loss = loss_function(target_q_value,q_action)  
      
        
        # Backpropagation:
        grads = tape.gradient(loss,model.trainable_variables)
        optimizer.apply_gradients(zip(grads,model.trainable_variables))
        
        # Update the weights:
        weights = model.get_weights()
        model.set_weights(weights)
        
        # Update the Target Network every 10 iterations:
        if state_index%10 == 0:
            target_dqn.set_weights(model.get_weights())
        
        # Update last action taken
        last_action = action
        
        # Decay epsilon:
        epsilon = epsilon*decay
        
        # Update running reward:
        cumulative_reward += action_reward
        
        # Empty history lists
        states_history = []
        actions_history = []
        
        # Log loss:
        Losses.append(loss)
        
        # Log cumulative rewards:
        cumul_reward.append(cumulative_reward)
        
        # Log current equity:
        curr_equity.append(current_equity)
        
        # Show iteration,loss and exploration rate:
        print("--------------------------------------")
        print("- Iteration number:",state_index)
        print("- Loss is equal to:",np.array(loss))
        print("- Exploration probability is:",epsilon)

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-
参考1 参考2 解决方案 # 点击安装源 协议选择 http:// 路径填写 mirrors.aliyun.com/centos/8.3.2011/BaseOS/x86_64/os URL类型 软件库URL 其他路径 # 版本 7 mirrors.aliyun.com/centos/7/os/x86
报错1 [root@slave1 data_mocker]# kafka-console-consumer.sh --bootstrap-server slave1:9092 --topic topic_db [2023-12-19 18:31:12,770] WARN [Consumer clie
错误1 # 重写数据 hive (edu)&gt; insert overwrite table dwd_trade_cart_add_inc &gt; select data.id, &gt; data.user_id, &gt; data.course_id, &gt; date_format(
错误1 hive (edu)&gt; insert into huanhuan values(1,&#39;haoge&#39;); Query ID = root_20240110071417_fe1517ad-3607-41f4-bdcf-d00b98ac443e Total jobs = 1
报错1:执行到如下就不执行了,没有显示Successfully registered new MBean. [root@slave1 bin]# /usr/local/software/flume-1.9.0/bin/flume-ng agent -n a1 -c /usr/local/softwa
虚拟及没有启动任何服务器查看jps会显示jps,如果没有显示任何东西 [root@slave2 ~]# jps 9647 Jps 解决方案 # 进入/tmp查看 [root@slave1 dfs]# cd /tmp [root@slave1 tmp]# ll 总用量 48 drwxr-xr-x. 2
报错1 hive&gt; show databases; OK Failed with exception java.io.IOException:java.lang.RuntimeException: Error in configuring object Time taken: 0.474 se
报错1 [root@localhost ~]# vim -bash: vim: 未找到命令 安装vim yum -y install vim* # 查看是否安装成功 [root@hadoop01 hadoop]# rpm -qa |grep vim vim-X11-7.4.629-8.el7_9.x
修改hadoop配置 vi /usr/local/software/hadoop-2.9.2/etc/hadoop/yarn-site.xml # 添加如下 &lt;configuration&gt; &lt;property&gt; &lt;name&gt;yarn.nodemanager.res