本节主要内容

word representation
word2vec

How to represent word?

Problems with resources like WordNet

1.Great as a resource but missing nuance​​​​​​​
2 Missing new meanings of words

  • Impossible to keep up-to-date

3.Subjective
4.Requires human labor to create and adapt
5.Can’t compute accurate word similarity

Problem with words as discrete symbols(one-hot vector)

1.No natural notion of similarity for one-hot vectors
2.Vector dimension is too large

Word2vec

Word2Vec:objective function

求极大似然
For each position t = 1 , , T t = 1, … , T t=1,,T, predict context words within a
window of fixed size m, given center word w j w_j wj.
极大似然公式:
L i k e l i h o o d = L ( θ ) = <munderover> t = 1 T </munderover> <munder> m j m </munder> P ( w t + j w t ; θ ) Likelihood = L(\theta)=\prod_{t=1}^T\prod_{-m\leq j\leq m}P(w_{t+j}|w_t;\theta) Likelihood=L(θ)=t=1TmjmP(wt+jwt;θ)
损失函数:
J ( θ ) = 1 T l o g L ( θ ) = 1 T <munderover> t = 1 T </munderover> <munder> m j m </munder> l o g P ( w t + j w t ; θ ) J(\theta)=-\frac1TlogL(\theta)=-\frac1T\sum_{t=1}^T\sum_{-m\leq j \leq m}logP(w_{t+j}|w_t;\theta) J(θ)=T1logL(θ)=T1t=1TmjmlogP(wt+jwt;θ)

如何计算 P ( w t + j w t ; θ ) P(w_{t+j}|w_t;\theta) P(wt+jwt;θ)?

定义符号如下,

  • v w &ThickSpace; w h e n &ThickSpace; w &ThickSpace; i s &ThickSpace; a &ThickSpace; c e n t e r &ThickSpace; w o r d v_w\; when\; w\; is\; a\; center\; word vwwhenwisacenterword
  • u w &ThickSpace; w h e n &ThickSpace; w &ThickSpace; i s &ThickSpace; a &ThickSpace; c o n t e x t &ThickSpace; w o r d u_w\; when\; w\; is\; a\;context\; word uwwhenwisacontextword


P ( o c ) = e x p ( u o T v c ) <munder> w V </munder> e x p ( u w T v c ) P(o|c) = \frac{exp(u_o^Tv_c)}{\sum\limits_{w\in V}exp(u_w^Tv_c)} P(oc)=wVexp(uwTvc)exp(uoTvc)

f ( θ ) = l o g P ( o c ) f(\theta)=logP(o|c) f(θ)=logP(oc) ,求偏导

<mstyle displaystyle="true" scriptlevel="0"> f ( θ ) v c </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = v c l o g e x p ( u o T v c ) <munder> w V </munder> e x p ( u w T v c ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = v c l o g e x p ( u o T v c ) v c l o g <munder> w V </munder> e x p ( u w T v c ) </mstyle> \begin{aligned} \frac{\partial f(\theta)}{\partial v_c}&amp;=\frac \partial{\partial v_c}log\frac{exp(u_o^Tv_c)}{\sum\limits_{w\in V}exp(u^T_wv_c)}\\&amp;=\frac \partial{\partial v_c}logexp(u_o^Tv_c)-\frac\partial{\partial v_c}log\sum\limits_{w\in V}exp(u_w^Tv_c)\end{aligned} vcf(θ)=vclogwVexp(uwTvc)exp(uoTvc)=vclogexp(uoTvc)vclogwVexp(uwTvc)
前一部分
f 1 ( θ ) = v c l o g e x p ( u o T v c ) = u o f_1(\theta)=\frac \partial{\partial v_c}logexp(u_o^Tv_c)=u_o f1(θ)=vclogexp(uoTvc)=uo
后一部分
<mstyle displaystyle="true" scriptlevel="0"> f 2 ( θ ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = v c l o g <munder> w V </munder> e x p ( u w T v c ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 <munder> w V </munder> e x p ( u w T v c ) <munder> x V </munder> e x p ( u x T v c ) u x </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munder> x V </munder> e x p ( u x T v c ) <munder> w V </munder> e x p ( u w T v c ) u x </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munder> x V </munder> P ( x c ) u x </mstyle> \begin{aligned} f_2(\theta)&amp;=\frac\partial{\partial v_c}log\sum\limits_{w\in V}exp(u_w^Tv_c)\\&amp;=\frac1{\sum\limits_{w\in V}exp(u_w^Tv_c)}\sum\limits_{x\in V}exp(u_x^Tv_c)\cdot u_x\\&amp;=\sum\limits_{x\in V}\frac{exp(u_x^Tv_c)}{\sum\limits_{w\in V}exp(u_w^Tv_c)}\cdot u_x \\&amp;=\sum\limits_{x\in V}P(x|c)\cdot u_x \end{aligned} f2(θ)=vclogwVexp(uwTvc)=wVexp(uwTvc)1xVexp(uxTvc)ux=xVwVexp(uwTvc)exp(uxTvc)ux=xVP(xc)ux
f ( θ ) = f 1 ( θ ) f 2 ( θ ) = u o <munder> x V </munder> P ( x c ) u x f(\theta) = f_1(\theta)-f_2(\theta)=u_o-\sum\limits_{x\in V}P(x|c)\cdot u_x f(θ)=f1(θ)f2(θ)=uoxVP(xc)ux
根据概率论的知识, x = 1 V p ( x c ) <mover accent="true"> u x </mover> \sum_{x=1}^{V}p(x|c)\vec{u_x} x=1Vp(xc)ux 正是 <mover accent="true"> u o </mover> \vec{u_o} uo 对应的期望向量的方向,而 f <mover accent="true"> v c </mover> \frac{\partial f}{\partial{\vec{v_c}}} vc f这个梯度则是把当前的 <mover accent="true"> u o </mover> \vec{u_o} uo 向其期望靠拢的话,需要的一个向量的差值,这与 f <mover accent="true"> v c </mover> \frac{\partial f}{\partial{\vec{v_c}}} vc f的定义刚好一致。