定义符号:
X i = <munderover> j = 1 N </munderover> X i , j P i , k = X i , k X i r a t i o i , j , k = P i , k P j , k X_i = \sum_{j=1}^N{X_{i,j}}\\ P_{i,k} = \frac{X_{i,k}}{X_i}\\ ratio_{i,j,k} = \frac{P_{i,k}}{P_{j,k}} Xi=j=1NXi,jPi,k=XiXi,kratioi,j,k=Pj,kPi,k

ratioi,j,k的值 单词j,k相关 单词j,k不相关
单词i,k相关 趋近1 很大
单词i,k不相关 很小 趋近1

推导:
假设已经得到词向量,则词向量和共现矩阵应该具有很好的一致性。假设词向量$v_i ,v_j, v_k$计算 r a t i o i , j , k ratio_{i,j,k} ratioi,j,k的函数为 g ( w i , w j , w k ) g(w_i ,w_j ,w_k) g(wi,wj,wk),则:

P i , k P j , k = r a t i o i , j , k = g ( w i , w j , w k ) \frac{P_{i,k}}{P_{j,k}} = ratio_{i,j,k} = g(w_{i},w_{j},w_{k}) Pj,kPi,k=ratioi,j,k=g(wi,wj,wk)
需要等式左右尽可能接近,所以代价函数:
J = <munderover> i , j , k N </munderover> ( P i , k P j , k g ( w i , w j , w k ) ) 2 J = \sum_{i,j,k}^N(\frac{P_{i,k}}{P_{j,k}}-g(w_{i},w_{j},w_{k}))^2 J=i,j,kN(Pj,kPi,kg(wi,wj,wk))2
但是模型包括三个单词,复杂度 N N N N*N*N NNN
如何简化:

  1. 要考虑单词i和j之间的关系,则g大概会有 w i w j w_i - w_j wiwj;
  2. r a t i o i , j , k ratio_{i,j,k} ratioi,j,k是标量,g也应该是标量,所以g应该包含 ( w i w j ) T w k (w_i-w_j)^Tw_k (wiwj)Twk;
  3. 再套上指数运算 e x p ( ) exp() exp(),最终 g ( w i , w j , w k ) = e x p ( ( w i w j ) T w k ) g(w_i,w_j,w_k) = exp((w_i-w_j)^Tw_k) g(wi,wj,wk)=exp((wiwj)Twk)

P i , k P j , k = g ( w i , w j , w k ) P i , k P j , k = e x p ( ( w i w j ) T w k ) P i , k P j , k = e x p ( w i T w k w j T w k ) P i , k P j , k = e x p ( w i T w k ) e x p ( w j T w k ) \frac{P_{i,k}}{P_{j,k}} = g(w_i,w_j,w_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp((w_i-w_j)^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = exp(w_i^Tw_k-w_j^Tw_k)\\ \frac{P_{i,k}}{P_{j,k}} = \frac{exp(w_i^Tw_k)}{exp(w_j^Tw_k)} Pj,kPi,k=g(wi,wj,wk)Pj,kPi,k=exp((wiwj)Twk)Pj,kPi,k=exp(wiTwkwjTwk)Pj,kPi,k=exp(wjTwk)exp(wiTwk)
可以看出:
P i , j = e x p ( w i T w j ) P_{i,j} = exp(w_i^Tw_j) Pi,j=exp(wiTwj) l o g ( X i , j ) l o g ( X i ) = w i T w j log(X_{i,j}) - log(X_i) = w_i^Tw_j log(Xi,j)log(Xi)=wiTwj l o g ( X i , j ) = w i T w j + b i + b j log(X_{i,j}) = w_i^Tw_j+b_i+b_j log(Xi,j)=wiTwj+bi+bj
损失函数变为:
J = <munderover> i , j N </munderover> ( w i T w j + b i + b j l o g ( X i , j ) ) 2 J = \sum_{i,j}^N(w_i^Tw_j+b_i+b_j-log(X_{i,j}))^2 J=i,jN(wiTwj+bi+bjlog(Xi,j))2
矩阵分解方法,有个缺点,就是各个词的权重是一样的
基于出现频率越高的词对权重应该越大的原则,损失函数添加权重项:
J = <munderover> i , j N </munderover> f ( X i , j ) ( v i T v j + b i + b j l o g ( X i , j ) ) 2 J = \sum_{i,j}^Nf(X_{i,j})(v_i^Tv_j+b_i+b_j-log(X_{i,j}))^2 J=i,jNf(Xi,j)(viTvj+bi+bjlog(Xi,j))2 f ( x ) = { <mstyle displaystyle="false" scriptlevel="0"> ( x / x m a x ) 0.75 , </mstyle> <mstyle displaystyle="false" scriptlevel="0"> <mtext> if  </mtext> x &lt; x m a x </mstyle> <mstyle displaystyle="false" scriptlevel="0"> 1 , </mstyle> <mstyle displaystyle="false" scriptlevel="0"> <mtext> if  </mtext> x &gt; = x m a x </mstyle> f(x) = \begin{cases} (x/x_{max})^{0.75}, &amp;\text{if } x &lt; x_{max} \\ 1, &amp;\text{if } x&gt;=x_{max} \end{cases} f(x)={(x/xmax)0.75,1,if x<xmaxif x>=xmax