如何在不重新计算每次迭代后返回控制权的强化学习程序中使用Tensorflow Optimizer而不重新计算激活量？

如何解决如何在不重新计算每次迭代后返回控制权的强化学习程序中使用Tensorflow Optimizer而不重新计算激活量？

现在，在Tensorflow（0.6）中，您要执行的操作非常困难。最好的选择是硬着头皮打电话多次，但要重新计算激活次数。但是，我们内部非常了解此问题。一个“部分运行”解决方案的原型正在开发中，但是目前尚无完成时间表。因为一个真正令人满意的答案可能需要修改tensorflow本身，所以您也可以为此创建一个github问题，看看是否还有其他人对此发表意见。

编辑：现在对partial_run的实验性支持。https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/client/session.py#L317

解决方法

编辑（1/3/16）：对应的github问题

我正在使用Tensorflow（Python接口）来实现q-learning使用训练的函数近似的代理stochastic gradientdescent。

在实验的每次迭代中，都会调用代理中的阶跃函数，该阶跃函数根据新的奖励和激活来更新逼近器的参数，然后选择要执行的新动作。

这是问题所在（使用强化学习术语）：

代理计算其状态-动作值预测以选择一个动作。
然后将控制权交还给另一个模拟环境中步骤的程序。
现在，将为下一次迭代调用代理的step函数。我想使用Tensorflow的Optimizer类为我计算梯度。但是，这既需要我计算了最后一步的状态操作值预测，也需要它们的图。所以：
- 如果我在整个图形上运行优化器，则它必须重新计算状态操作值预测。
- 但是，如果我将预测（针对所选操作）存储为变量，然后将其作为占位符提供给优化器，则它不再具有计算梯度所需的图。
- 我不能只在同sess.run()一条语句中运行它，因为我必须放弃控制权并返回所选择的动作，以便获得下一个观察和奖励（用于损失函数的目标）。

因此，有没有一种我可以（无需加强学习术语）的方法：

计算图的一部分，返回value1。
将value1返回到调用程序以计算value2
在下一次迭代中，将value2用作我的损失函数的一部分，以进行梯度下降，而无需重新计算图形中计算value1的部分。

当然，我考虑了显而易见的解决方案：

只需对梯度进行硬编码：对于我现在使用的非常简单的逼近器来说，这将很容易，但是如果我在大型卷积网络中尝试使用不同的滤波器和激活函数，将非常不便。如果可能的话，我真的很想使用Optimizer类。
从代理内部调用环境模拟：此系统可以执行此操作，但是这会使我的工作更加复杂，并删除了许多模块化和结构。所以，我不想这样做。

我已多次阅读API和白皮书，但似乎无法提出解决方案。我试图提出一种将目标馈入图形以计算梯度的方法，但无法提出一种自动构建该图形的方法。

如果事实证明这在TensorFlow中还不可能，您是否认为将其实现为新运算符会非常复杂吗？（我已经有两年没有使用C
++了，所以TensorFlow源代码看起来有些令人生畏。）还是我最好改用Torch之类的东西，它具有命令式区分Autograd而不是符号式区分？

感谢您抽出宝贵的时间来帮助我解决这个问题。我试图使这一点尽可能简洁。

编辑：经过进一步的搜索后，我遇到了这个先前提出的问题。这与我的稍有不同（他们试图避免在Torch中每次迭代都更新一次LSTM网络），并且还没有任何答案。

如果有帮助，请看以下代码：

'''
-Q-Learning agent for a grid-world environment.
-Receives input as raw RGB pixel representation of the screen.
-Uses an artificial neural network function approximator with one hidden layer

2015 Jonathon Byrd
'''

import random
import sys
#import copy
from rlglue.agent.Agent import Agent
from rlglue.agent import AgentLoader as AgentLoader
from rlglue.types import Action
from rlglue.types import Observation

import tensorflow as tf
import numpy as np

world_size = (3,3)
total_spaces = world_size[0] * world_size[1]

class simple_agent(Agent):

    #Contants
    discount_factor = tf.constant(0.5,name="discount_factor")
    learning_rate = tf.constant(0.01,name="learning_rate")
    exploration_rate = tf.Variable(0.2,name="exploration_rate")  # used to be a constant :P
    hidden_layer_size = 12

    #Network Parameters - weights and biases
    W = [tf.Variable(tf.truncated_normal([total_spaces * 3,hidden_layer_size],stddev=0.1),name="layer_1_weights"),tf.Variable(tf.truncated_normal([hidden_layer_size,4],name="layer_2_weights")]
    b = [tf.Variable(tf.zeros([hidden_layer_size]),name="layer_1_biases"),tf.Variable(tf.zeros([4]),name="layer_2_biases")]

    #Input placeholders - observation and reward
    screen = tf.placeholder(tf.float32,shape=[1,total_spaces * 3],name="observation") #input pixel rgb values
    reward = tf.placeholder(tf.float32,shape=[],name="reward")

    #last step data
    last_obs = np.array([1,2,3],ndmin=4)
    last_act = -1

    #Last step placeholders
    last_screen = tf.placeholder(tf.float32,name="previous_observation")
    last_move = tf.placeholder(tf.int32,shape = [],name="previous_action")

    next_prediction = tf.placeholder(tf.float32,name="next_prediction")

    step_count = 0

    def __init__(self):
        #Initialize computational graphs
        self.q_preds = self.Q(self.screen)
        self.last_q_preds = self.Q(self.last_screen)
        self.action = self.choose_action(self.q_preds)
        self.next_pred = self.max_q(self.q_preds)
        self.last_pred = self.act_to_pred(self.last_move,self.last_q_preds) # inefficient recomputation
        self.loss = self.error(self.last_pred,self.reward,self.next_prediction)
        self.train = self.learn(self.loss)
        #Summaries and Statistics
        tf.scalar_summary(['loss'],self.loss)
        tf.scalar_summary('reward',self.reward)
        #w_hist = tf.histogram_summary("weights",self.W[0])
        self.summary_op = tf.merge_all_summaries()
        self.sess = tf.Session()
        self.summary_writer = tf.train.SummaryWriter('tensorlogs',graph_def=self.sess.graph_def)


    def agent_init(self,taskSpec):
        print("agent_init called")
        self.sess.run(tf.initialize_all_variables())

    def agent_start(self,observation):
        #print("agent_start called,observation = {0}".format(observation.intArray))
        o = np.divide(np.reshape(np.asarray(observation.intArray),(1,total_spaces * 3)),255)
        return self.control(o)

    def agent_step(self,reward,observation):
        #print("agent_step called,observation = {0}".format(observation.intArray))
        print("step,reward: {0}".format(reward))
        o = np.divide(np.reshape(np.asarray(observation.intArray),255)

        next_prediction = self.sess.run([self.next_pred],feed_dict={self.screen:o})[0]

        if self.step_count % 10 == 0:
            summary_str = self.sess.run([self.summary_op,self.train],feed_dict={self.reward:reward,self.last_screen:self.last_obs,self.last_move:self.last_act,self.next_prediction:next_prediction})[0]

            self.summary_writer.add_summary(summary_str,global_step=self.step_count)
        else:
            self.sess.run([self.train],feed_dict={self.screen:o,self.reward:reward,self.next_prediction:next_prediction})

        return self.control(o)

    def control(self,observation):
        results = self.sess.run([self.action],feed_dict={self.screen:observation})
        action = results[0]

        self.last_act = action
        self.last_obs = observation

        if (action==0):  # convert action integer to direction character
            action = 'u'
        elif (action==1):
            action = 'l'
        elif (action==2):
            action = 'r'
        elif (action==3):
            action = 'd'
        returnAction=Action()
        returnAction.charArray=[action]
        #print("return action returned {0}".format(action))
        self.step_count += 1
        return returnAction

    def Q(self,obs):  #calculates state-action value prediction with feed-forward neural net
        with tf.name_scope('network_inference') as scope:
            h1 = tf.nn.relu(tf.matmul(obs,self.W[0]) + self.b[0])
            q_preds = tf.matmul(h1,self.W[1]) + self.b[1] #linear activation
            return tf.reshape(q_preds,shape=[4])

    def choose_action(self,q_preds):  #chooses action epsilon-greedily
        with tf.name_scope('action_choice') as scope:
            exploration_roll = tf.random_uniform([])
            #greedy_action = tf.argmax(q_preds,0)  # gets the action with the highest predicted Q-value
            #random_action = tf.cast(tf.floor(tf.random_uniform([],maxval=4.0)),tf.int64)

            #exploration rate updates
            #if self.step_count % 10000 == 0:
                #self.exploration_rate.assign(tf.div(self.exploration_rate,2))

            return tf.select(tf.greater_equal(exploration_roll,self.exploration_rate),tf.argmax(q_preds,0),#greedy_action
                tf.cast(tf.floor(tf.random_uniform([],tf.int64))  #random_action

        '''
        Why does this return NoneType?:

        flag = tf.select(tf.greater_equal(exploration_roll,'g','r')
        if flag == 'g':  #greedy
            return tf.argmax(q_preds,0) # gets the action with the highest predicted Q-value
        elif flag == 'r':  #random
            return tf.cast(tf.floor(tf.random_uniform([],tf.int64)
        '''

    def error(self,last_pred,r,next_pred):
        with tf.name_scope('loss_function') as scope:
            y = tf.add(r,tf.mul(self.discount_factor,next_pred)) #target
            return tf.square(tf.sub(y,last_pred)) #squared difference error


    def learn(self,loss): #Update parameters using stochastic gradient descent
        #TODO:  Either figure out how to avoid computing the q-prediction twice or just hardcode the gradients.
        with tf.name_scope('train') as scope:
            return tf.train.GradientDescentOptimizer(self.learning_rate).minimize(loss,var_list=[self.W[0],self.W[1],self.b[0],self.b[1]])


    def max_q(self,q_preds):
        with tf.name_scope('greedy_estimate') as scope:
            return tf.reduce_max(q_preds)  #best predicted action from current state

    def act_to_pred(self,a,preds): #get the value prediction for action a
        with tf.name_scope('get_prediction') as scope:
            return tf.slice(preds,tf.reshape(a,shape=[1]),[1])


    def agent_end(self,reward):
        pass

    def agent_cleanup(self):
        self.sess.close()
        pass

    def agent_message(self,inMessage):
        if inMessage=="what is your name?":
            return "my name is simple_agent";
        else:
            return "I don't know how to respond to your message";

if __name__=="__main__":
    AgentLoader.loadAgent(simple_agent())