微信公众号搜"智元新知"关注
微信扫一扫可直接关注哦!

绘制多臂强盗的奖励值

如何解决绘制多臂强盗的奖励值

如何在具有 Per-Arm 特征的 Multi-Arm Bandits 上绘制此 example from Tensorflow(带有完整代码)的每次迭代的奖励值?

教程有一个带有情节的遗憾指标:

def _all_rewards(observation,hidden_param):
  """Outputs rewards for all actions,given an observation."""
  hidden_param = tf.cast(hidden_param,dtype=tf.float32)
  global_obs = observation['global']
  per_arm_obs = observation['per_arm']
  num_actions = tf.shape(per_arm_obs)[1]
  tiled_global = tf.tile(
      tf.expand_dims(global_obs,axis=1),[1,num_actions,1])
  concatenated = tf.concat([tiled_global,per_arm_obs],axis=-1)
  rewards = tf.linalg.matvec(concatenated,hidden_param)
  return rewards

def optimal_reward(observation):
  """Outputs the maximum expected reward for every element in the batch."""
  return tf.reduce_max(_all_rewards(observation,reward_param),axis=1)

regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward)

num_iterations =  40# @param
steps_per_loop = 1 # @param

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.policy.trajectory_spec,batch_size=BATCH_SIZE,max_length=steps_per_loop)

observers = [replay_buffer.add_batch,regret_metric]

driver = dynamic_step_driver.DynamicStepDriver(
    env=per_arm_tf_env,policy=agent.collect_policy,num_steps=steps_per_loop * BATCH_SIZE,observers=observers)

regret_values = []

for _ in range(num_iterations):
  driver.run()
  loss_info = agent.train(replay_buffer.gather_all())
  replay_buffer.clear()
  regret_values.append(regret_metric.result())

plt.plot(regret_values)
plt.title('Regret of LinUCB on the Linear per-arm environment')
plt.xlabel('Number of Iterations')
_ = plt.ylabel('Average Regret')

enter image description here

而且我最终会喜欢这样的情节,但随着迭代的奖励显示它们在增加;我该如何修改代码来做到这一点?

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。