本节主要内容
word representation
word2vec
How to represent word?
Problems with resources like WordNet
1.Great as a resource but missing nuance
2 Missing new meanings of words
- Impossible to keep up-to-date
3.Subjective
4.Requires human labor to create and adapt
5.Can’t compute accurate word similarity
Problem with words as discrete symbols(one-hot vector)
1.No natural notion of similarity for one-hot vectors
2.Vector dimension is too large
Word2vec
Word2Vec:objective function
求极大似然
For each position t=1,…,T, predict context words within a
window of fixed size m, given center word wj.
极大似然公式:
Likelihood=L(θ)=t=1∏T−m≤j≤m∏P(wt+j∣wt;θ)
损失函数:
J(θ)=−T1logL(θ)=−T1t=1∑T−m≤j≤m∑logP(wt+j∣wt;θ)
如何计算 P(wt+j∣wt;θ)?
定义符号如下,
- vwwhenwisacenterword
- uwwhenwisacontextword
则
P(o∣c)=w∈V∑exp(uwTvc)exp(uoTvc)
令 f(θ)=logP(o∣c) ,求偏导
∂vc∂f(θ)=∂vc∂logw∈V∑exp(uwTvc)exp(uoTvc)=∂vc∂logexp(uoTvc)−∂vc∂logw∈V∑exp(uwTvc)
前一部分
f1(θ)=∂vc∂logexp(uoTvc)=uo
后一部分
f2(θ)=∂vc∂logw∈V∑exp(uwTvc)=w∈V∑exp(uwTvc)1x∈V∑exp(uxTvc)⋅ux=x∈V∑w∈V∑exp(uwTvc)exp(uxTvc)⋅ux=x∈V∑P(x∣c)⋅ux
f(θ)=f1(θ)−f2(θ)=uo−x∈V∑P(x∣c)⋅ux
根据概率论的知识, ∑x=1Vp(x∣c)ux 正是 uo 对应的期望向量的方向,而 ∂vc ∂f这个梯度则是把当前的 uo 向其期望靠拢的话,需要的一个向量的差值,这与 ∂vc ∂f的定义刚好一致。