

cs224n assignment2: Word2vec实现

本文是对cs224n_assignment 2实验中理论部分的总结。

原版lab 手册和code参见:

Stanford CS 224N | Natural Language Processing with Deep Learning



用center word预测outside word。


定义两张表 U U U V V V,同时也是该网络唯一的参数。

处理center word时,查询 V V V,处理outside word,查询 U U U

查询结果( u i , v j u_i,v_j ui,vj )分别作为outside word和center word的词向量。


center word c预测到outside word为o的概率为:
P ( O = o ∣ C = c ) = e x p ( u o T v c ) ∑ w ∈ v o c a b e x p ( u w T v c ) P(O=o|C=c)=\frac{exp(u_o^Tv_c)}{\sum_{w\in vocab}exp(u_w^Tv_c)} P(O=oC=c)=wvocabexp(uwTvc)exp(uoTvc)

import numpy as np
def softmax(x):
    """Compute the softmax function for each row of the input x.
    It is crucial that this function is optimized for speed because
    it will be used frequently in later code
	x -- A D dimensional vector or N x D dimensional numpy matrix.
	x -- You are allowed to modify x in-place
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix
        tmp = np.max(x, axis=1)
        x -= tmp.reshape((x.shape[0], 1))
        x = np.exp(x)
        tmp = np.sum(x, axis=1)
        x /= tmp.reshape((x.shape[0], 1))
        # Vector
        tmp = np.max(x)
        x -= tmp
        x = np.exp(x)
        tmp = np.sum(x)
        x /= tmp

    assert x.shape == orig_shape
    return x`
outsideWordVecs=np.random.rand(100,10) #U
centerWordVecs=np.random.rand(100,10) #V



P P P表征预测词的概率分布,既然是分类问题,则使用交叉熵损失函数。记 o o o即为当前预测的target,则损失函数
L = − l o g ( P ( O = o ∣ C = c ) ) \mathcal{L}=-log(P(O=o|C=c)) L=log(P(O=oC=c))
a r g U , V m i n   L arg _{U,V}min\ \mathcal{L} argU,Vmin L


∇ L = ( ∂ V L , ∂ U L ) \nabla\mathcal{L}=(\frac\partial{V}{\mathcal{L}},\frac\partial{U}{\mathcal{L}}) L=(VL,UL)
对于前一项,只需对 v 0 v_0 v0求导,其余地方梯度为0,不难证明:
∂ v c ( − l o g ( e x p ( u o T v c ) ∑ w ∈ v o c a b e x p ( u w T v c ) ) ) = ∂ v c l o g ( ∑ w ∈ v o c a b e x p ( u w T v c ) ) − ∂ v c l o g   e x p ( u o T v c ) = ∑ w ∈ v o c a b u w T e x p ( u w T v c ) e x p ( u w T v c ) − u o T = y ^ U − y U = ( y ^ − y ) U \begin{aligned} & \frac\partial{v_c}{(-log(\frac{exp(u_o^Tv_c)}{\sum_{w\in vocab}exp(u_w^Tv_c)}))}\\ & =\frac\partial{v_c}{log({\sum_{w\in vocab}exp(u_w^Tv_c)})}-\frac\partial{v_c}{log{\ exp(u_o^Tv_c)}}\\ & =\sum_{w\in vocab}\frac{u_w^Texp(u_w^Tv_c)}{exp(u_w^Tv_c)}-u_o^T\\ & =\hat{y}U-yU\\ & =(\hat{y}-y)U \end{aligned} vc(log(wvocabexp(uwTvc)exp(uoTvc)))=vclog(wvocabexp(uwTvc))vclog exp(uoTvc)=wvocabexp(uwTvc)uwTexp(uwTvc)uoT=y^UyU=(y^y)U

即求 ∂ L u i \partial\frac{\mathcal{L}}{u_i} uiL即可,若 i ! = o i!=o i!=o:
∂ L u i = ∂ u i l o g ( ∑ w ∈ v o c a b e x p ( u w T v c ) ) − ∂ u i l o g   e x p ( u o T v c ) = v c e x p ( u i T v c ) ∑ w ∈ v o c a b e x p ( u w T v c ) = y ^ i v c \begin{aligned} &\partial\frac{\mathcal{L}}{u_i}\\ &=\frac\partial{u_i}{log({\sum_{w\in vocab}exp(u_w^Tv_c)})}-\frac\partial{u_i}{log{\ exp(u_o^Tv_c)}}\\ &=\frac{v_cexp(u_i^Tv_c)} {\sum_{w\in vocab}exp(u_w^Tv_c)}\\ &=\hat{y}_{i}v_c \end{aligned} uiL=uilog(wvocabexp(uwTvc))uilog exp(uoTvc)=wvocabexp(uwTvc)vcexp(uiTvc)=y^ivc
∂ L u o = ∂ u o l o g ( ∑ w ∈ v o c a b e x p ( u w T v c ) ) − ∂ u o l o g   e x p ( u o T v c ) = v c e x p ( u o T v c ) ∑ w ∈ v o c a b e x p ( u w T v c ) − v c = ( y ^ o − 1 ) v c \begin{aligned} &\partial\frac{\mathcal{L}}{u_o}\\ &=\frac\partial{u_o}{log({\sum_{w\in vocab}exp(u_w^Tv_c)})}-\frac\partial{u_o}{log{\ exp(u_o^Tv_c)}}\\ &=\frac{v_cexp(u_o^Tv_c)} {\sum_{w\in vocab}exp(u_w^Tv_c)} - v_c\\ &=(\hat{y}_{o}-1)v_c \end{aligned} uoL=uolog(wvocabexp(uwTvc))uolog exp(uoTvc)=wvocabexp(uwTvc)vcexp(uoTvc)vc=(y^o1)vc
不难发发现,对于每一次loss,更新 V V V只需一步,但更新 U U U需要很多步(遍历词表长度,才能得到对 U U U的完整梯度)


想办法减少查询 U U U表的次数,只查询个别单词,即可减少计算量。约定目标预测正确答案(正样本)为 o o o,center word为 c c c随机选择的负样本为 w s , 1 < = s < = k w_s,1<=s<=k ws,1<=s<=k,新的损失函数为:
L = − l o g ( σ ( u o T v c ) ) − ∑ 1 < = s < = K l o g ( σ ( − u w s T v c ) ) \mathcal{L}=-log(σ(u^T_ov_c)) - \sum_{1<=s<=K}log(σ(-u^T_{w_s}v_c)) L=log(σ(uoTvc))1<=s<=Klog(σ(uwsTvc))


l o g ( σ ( u o T v c ) ) log(σ(u^T_ov_c)) log(σ(uoTvc))即为正样本对应概率,添加负号,即符合目标

∑ 1 < = s < = K l o g ( σ ( − u w s T v c ) ) \sum_{1<=s<=K}log(σ(-u^T_{w_s}v_c)) 1<=s<=Klog(σ(uwsTvc))即为负样本对应概率的相反数,再添加负号即符合最小化目标。

更详细的思路介绍,参考cs224n assignment 2的实验手册。


