了解 OpenAI 健身房和稳定基线中的多智能体学习

如何解决了解 OpenAI 健身房和稳定基线中的多智能体学习

我正在尝试使用 OpenAI 稳定基线和健身房开发多智能体强化学习模型，如 this 文章中所述。

我对如何指定对手代理感到困惑。似乎对手被传递给了环境，就像下面的agent2：

class ConnectFourGym:
    def __init__(self,agent2="random"):
        ks_env = make("connectx",debug=True)
        self.env = ks_env.train([None,agent2])

ks_env.train() 方法似乎是来自 kaggle_environments.Environment 的方法：

def train(self,agents=[]):
    """
    Setup a lightweight training environment for a single agent.
    Note: This is designed to be a lightweight starting point which can
          be integrated with other frameworks (i.e. gym,stable-baselines).
          The reward returned by the "step" function here is a diff between the
          current and the prevIoUs step.
    Example:
        env = make("tictactoe")
        # Training agent in first position (player 1) against the default random agent.
        trainer = env.train([None,"random"])

Q1。但是我很困惑。为什么 ConnectFourGym.__init__() 调用 train() 方法？那就是为什么环境要做培训呢？我觉得，train() 应该是模型的一部分：上面的文章使用了 contains train() method 的 PPO 算法。这个 PPO.train() 在我们调用 PPO.learn() 时被调用，这是有道理的。

Q2. 但是，阅读 PPO.learn() 的代码后，我看不出它是如何针对多个对手智能体训练当前智能体的。模型算法不应该这样做吗？读错了吗？或者模型不知道代理的数量，它只为环境所知，这就是为什么环境包含train()？在那种情况下，为什么我们有显式的 Environment.train() 方法？环境将根据多个代理行为返回奖励，模型将从中学习。

还是完全被基本概念搞糊涂了？ somoene可以帮我吗？