参考https://www.kesci.com/home/project/5e78854998d4a8002d2c4696

概述

word2vec是一种处理文本的模型,它本质就是一个双层神经网络
输入是文本的onehot编码,输出对应的向量
Word2vec通过上下文学习单词的矢量表示

之前的方法有什么不足

不管是Countervector还是TFIDF,我们发现它们都是从全局词汇的分布来对文本进行表示,所以缺点也明显,

它忽略了单个文本句子中词的顺序, 例如 'this is bad' 在BOW中的表示和 'bad is this'是一样的;
它忽略了词的上下文,假设我们写一个句子,"He loved books. Education is best found in books".我们会在处理这两句话的时候是不会考虑前一个句子或者后一个句子是什么意思,但是他们之间是存在某些关系的
为了克服上述的两个缺陷,Word2Vec被开发出来并用来解决上述的两个问题。

优势比较

相对于用onehot编码生成的向量,用word2vec生成的向量它的主要优势就在于将原本稀疏的高维向量转换成低位稠密的向量,同时兼顾了词与词之间的关系

两个模型:

CBOW 将一个词的上下文作为输入,而那个词本身作为输出
Skip-gram 将一个词作为输入,而这个词的上下文作为输出

代码解析

  • 构建映射,将词映射到整数上,,即用set得到无重复列表,该表中每个单词的编号即为映射值
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

split_ind = (int)(len(text) * 0.8)
vocab = set(text)
vocab_size = len(vocab)
print('vocab_size:', vocab_size)

w2i = {w: i for i, w in enumerate(vocab)}
i2w = {i: w for i, w in enumerate(vocab)}
  • 构建所需的数据形式,以Cbow为例
def create_cbow_dataset(text):
    data = []
    for i in range(2, len(text)
      context = [text[i2], text[i - 1],
             text[i + 1], text[i + 2]]
      target = text[i]
      data.append((context, target))
    return data
cbow_train = create_cbow_dataset(text)
  • 构建模型框架

    class CBOW(nn.Module):
      def __init__(self, vocab_size, embd_size, context_size, hidden_size):
          super(CBOW, self).__init__()
          self.embeddings = nn.Embedding(vocab_size, embd_size)
          self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
          self.linear2 = nn.Linear(hidden_size, vocab_size)
    
      def forward(self, inputs):
          embedded = self.embeddings(inputs).view((1, -1))
          hid = F.relu(self.linear1(embedded))
          out = self.linear2(hid)
          log_probs = F.log_softmax(out)
          return log_probs

    其中,对于nn.Embedding()这个函数,官方source里有一个例子可以帮助理解https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding:

    Examples::
    
          >>> # an Embedding module containing 10 tensors of size 3
          >>> embedding = nn.Embedding(10, 3)
          >>> # a batch of 2 samples of 4 indices each
          >>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
          >>> embedding(input)
          tensor([[[-0.0251, -1.6902,  0.7172],
                   [-0.6431,  0.0748,  0.6969],
                   [ 1.4970,  1.3448, -0.9685],
                   [-0.3677, -2.7265, -0.1685]],
    
                  [[ 1.4970,  1.3448, -0.9685],
                   [ 0.4362, -0.4004,  0.9400],
                   [-0.6431,  0.0748,  0.6969],
                   [ 0.9124, -2.3616,  1.1151]]])
  • 训练模型
embd_size = 100
learning_rate = 0.001
n_epoch = 30

def train_cbow():
    hidden_size = 64
    losses = []
    loss_fn = nn.NLLLoss()
    model = CBOW(vocab_size, embd_size, CONTEXT_SIZE, hidden_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

    for epoch in range(n_epoch):
        total_loss = .0
        for context, target in cbow_train:
            ctx_idxs = [w2i[w] for w in context]
            ctx_var = Variable(torch.LongTensor(ctx_idxs))

            model.zero_grad()
            log_probs = model(ctx_var)

            loss = loss_fn(log_probs, Variable(torch.LongTensor([w2i[target]])))

            loss.backward()
            optimizer.step()

            total_loss += loss.data
        losses.append(total_loss)
    return model, losses

注意这里的loss_fn = nn.NLLLoss()比较特别,因为log_probs = model(ctx_var)得到的值是log_softmax,而label是torch.LongTensor([w2i[target]]),直接用的标签值。
这里NLLLoss的输入是一个对数概率向量和一个目标标签(不需要是one-hot编码形式的). 它不会为我们计算对数概率. 适合网络的最后一层是log_softmax.