使用 VowpalWabbit 优化 CTR点击率：如何选择合适的参数？

如何解决使用 VowpalWabbit 优化 CTR点击率：如何选择合适的参数？

我正在尝试使用 VowpalWabbit 优化某些文章或广告（操作）的点击率，给定设备类型（上下文）（遵循本文 vw tutorial）。但是，我无法使其可靠地收敛到最佳动作。

我创建了一个最小的工作示例（抱歉篇幅过长）：

import random
import numpy as np
from matplotlib import pyplot as plt
from vowpalwabbit import pyvw

plt.ion()

action_space = ["article-1","article-2","article-3"]

def running_mean(x,N):
    cumsum = np.cumsum(np.insert(x,0))
    return (cumsum[N:] - cumsum[:-N]) / float(N)

def to_vw_example_format(context,cb_label=None):
    if cb_label is not None:
        chosen_action,cost,prob = cb_label
    example_string = ""
    example_string += "shared |User device={} \n".format(context)
    for action in action_space:
        if cb_label is not None and action == chosen_action:
            example_string += "1:{}:{} ".format(cost,prob)
        example_string += "|Action ad={} \n".format(action)
    # Strip the last newline
    return example_string[:-1]

# deFinition of problem to solve,playing out the article with highest ctr given a context
context_to_action_ctr = {
    "device-1": {"article-1": 0.05,"article-2": 0.06,"article-3": 0.04},"device-2": {"article-1": 0.08,"article-2": 0.07,"article-3": 0.05},"device-3": {"article-1": 0.01,"article-2": 0.04,"article-3": 0.09},"device-4": {"article-1": 0.04,"article-3": 0.045},"device-5": {"article-1": 0.09,"article-2": 0.01,"article-3": 0.07},"device-6": {"article-1": 0.03,"article-2": 0.09,"article-3": 0.04}
}

#vw = f"--cb_explore 3 -q UA -q UU --epsilon 0.1"
vw = f"--cb_explore_adf -q UA -q UU --bag 5 "
#vw = f"--cb_explore_adf -q UA --epsilon 0.2"

actor = pyvw.vw(vw)

random_rewards = []
actor_rewards = []
optimal_rewards = []

for step in range(200000):

    # pick a random context
    device = random.choice(list(context_to_action_ctr.keys()))

    # let vw generate probability distribution
    # action_probabilities = np.array(actor.predict(f"|x device:{device}"))
    action_probabilities = np.array(actor.predict(to_vw_example_format(device)))

    # sample action
    probabilities = action_probabilities / action_probabilities.sum()
    action_idx = np.random.choice(len(probabilities),1,p=probabilities)[0]
    probability = action_probabilities[action_idx]

    # get reward/regret
    action_to_reward_regret = {
        action: (1,0) if random.random() < context_to_action_ctr[device][action] else (0,1) for action in action_space
    }

    actor_action = action_space[action_idx]
    random_action = random.choice(action_space)
    optimal_action = {
        "device-1": "article-2","device-2": "article-1","device-3": "article-3","device-4": "article-3","device-5": "article-1","device-6": "article-2",}[device]

    # update statistics
    actor_rewards.append(action_to_reward_regret[actor_action][0])
    random_rewards.append(action_to_reward_regret[random_action][0])
    optimal_rewards.append(action_to_reward_regret[optimal_action][0])

    # learn online
    reward,regret = action_to_reward_regret[actor_action]
    cost = -1 if reward == 1 else 0
    # actor.learn(f"{action_idx+1}:{cost}:{probability} |x device:{device}")
    actor.learn(to_vw_example_format(device,(actor_action,probability)))


    if step % 100 == 0 and step > 1000:
        plt.clf()
        axes = plt.gca()
        plt.title("Reward over time")
        plt.plot(running_mean(actor_rewards,10000),label=str(vw))
        plt.plot(running_mean(random_rewards,label="Random actions")
        plt.plot(running_mean(optimal_rewards,label="Optimal actions")
        plt.legend()
        plt.pause(0.0001)

本质上，有 3 种可能的操作（文章 1-3）和 6 个上下文（设备 1-6），每种组合都有特定的 CTR（点击率）和给定上下文的最佳操作（给定最高点击率的文章）装置）。每次迭代都会对随机上下文进行采样，并计算每个动作的奖励/遗憾。如果奖励为 1（用户点击），则 VowpalWabbit 用于学习的成本为 -1；如果奖励为 0（用户未点击），则成本为 0。随着时间的推移，该算法应该为每个设备找到最好的文章。

一些结果（随着时间的推移平均奖励）：

问题：

多次启动同一个程序，导致不同的结果（有时根本不收敛，有时找到最佳并坚持下去）
由于起步不佳（早期收敛到次优武器），随着时间的推移，算法仍未转向更好的武器。
该算法似乎很早就陷入了局部最小值，这定义了实验其余部分的性能。（即使有一些探索因素）

由于 CTR 相当小，并且需要大量播放才能收敛，我理解问题的难度。然而，我希望随着时间的推移，算法会找到最佳的。

我是否遗漏了 VowpalWabbit 的配置？

解决方法

如果所有策略彼此一致，仅使用没有 epsilon 参数的“--bag 5”就可以开始产生零概率。

我稍微修改了您的代码以跟踪来自 actor.predict 的概率。

第一幅图：所有“最佳动作”的预测概率的移动平均值。我们可以看到，对于其中的几个，我们实际上保持在 0 左右。带有决策的表的尾部显示所有分布实际上都类似于 [1,0]。所以没有机会从这一点恢复，因为我们基本上已经关闭了探索。

添加少量探索（--bag 5 --epsilon 0.02）有助于最终收敛到全局最小值并给出如下图：

学习似乎并不快，但有问题的上下文实际上是最模糊的，不会造成太大的遗憾。