PyTorch中Word Embeddings

在介绍N-Gram 语言模型之前我们先简单介绍一下如何在PyTorch中使用嵌入以及一般的深度学习编程。我们需要在使用word embeddings时为每个单词定义一个索引。也就是说，嵌入被存储为| V |×D 矩阵，其中D是嵌入的维数，|V|是词库的大小。在PyTorch中有一个模块:torch.nn.Embedding，这个模块存储word embdding，它接收二个参数：词典大小，embeddings的维度。使用方法如下：

import torch
import torch.autograd as autograd
import torch.nn as nn

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["world"]]) # index 转为tensor
hello_embed = embeds(autograd.Variable(lookup_tensor)) # Variable包装tensor
print(hello_embed)

打印结果：
Variable containing:
 0.7441  1.4024  0.0598  0.6661  0.8819
[torch.FloatTensor of size 1x5]

注意：在需要用到word embedding时，使用torch.nn.Embedding存储，在训练模型的过程，会不断的从这里面取值，更新，当模型训练完之后，最终会得到一套，所有以前经常也说word embedding是训练模型的过程中得到的副产物，但是这个副产物目前在NLP界确实是一个大的突破。

N-Gram Language Modeling

在计算语言学领域，n-gram是来自给定文本的n个词的连续序列。n-gram通常是从文本中收集的。一般大小为1的n-gram被称为“unigram”。大小2是一个“bigram; 大小3是一个“trigram”，以此类推。

N-Gram语言模型简单的说就是假设当前词只和前N个词相关。那我们在训练神经网络时，就会用(前N个词,当前词)构建训练样本对，用大型的语料库训练完之后，会得到每个词的word embedding。

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

test_sentence = """When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a totter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold.""".split()

# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)

trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]

vocab = set(test_sentence)

word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {i: word for i, word in enumerate(vocab)}

class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        # 简单的查找表，存储固定字典和大小的嵌入。通常用于存储单词嵌入并使用索引来检索它们。
        # 模块的输入是一个索引列表，输出是相应的单词嵌入。
        # 初始化一个矩阵(vocab_size * embedding_dim)
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        # 取那个矩阵里面inputs所在的行
        embdeds = self.embeddings(inputs).view((1, -1))

        out = F.relu(self.linear1(embdeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs, self.embeddings


# model
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

losses = []
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.0001)


for epoch in range(2000):
    total_loss = torch.Tensor([0])
    for context, target in trigrams:
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in variables)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))


        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs,embedd = model(context_var)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a variable)
        loss = loss_function(log_probs, autograd.Variable(
            torch.LongTensor([word_to_ix[target]])))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        total_loss += loss.data
    losses.append(total_loss)
    if epoch % 100 == 0:
        print(total_loss[0])  # The loss decreased every iteration over the training data!

for i in range(0, len(vocab)):
    lookup_tensor = torch.LongTensor([i])
    print(ix_to_word[i])
    print(embedd(autograd.Variable(lookup_tensor)))

输出结果:
total loss:  520.6974487304688
total loss:  495.9122619628906
total loss:  472.0893249511719
total loss:  448.5113525390625
total loss:  424.6984558105469
total loss:  400.19091796875
total loss:  374.609130859375
total loss:  347.8531188964844
total loss:  319.9690856933594
total loss:  291.2785339355469
total loss:  262.2281799316406
total loss:  233.42706298828125
total loss:  205.58128356933594
total loss:  179.31130981445312
total loss:  155.13775634765625
total loss:  133.40785217285156
total loss:  114.27467346191406
total loss:  97.73793029785156
total loss:  83.66402435302734
total loss:  71.83905029296875
total loss:  61.99850845336914
total loss:  53.86662292480469
total loss:  47.167816162109375
total loss:  41.64924621582031
total loss:  37.089263916015625
total loss:  33.29917526245117
total loss:  30.126102447509766
total loss:  27.44985008239746
total loss:  25.175071716308594
total loss:  23.226322174072266
total loss:  21.544204711914062
total loss:  20.08189582824707
total loss:  18.80255126953125
total loss:  17.6766357421875
total loss:  16.680313110351562
total loss:  15.793685913085938
total loss:  15.000688552856445
total loss:  14.288226127624512
total loss:  13.645284652709961
total loss:  13.062820434570312
total loss:  12.533195495605469
total loss:  12.049880981445312
total loss:  11.607430458068848
total loss:  11.201164245605469
total loss:  10.826983451843262
total loss:  10.48138427734375
total loss:  10.161420822143555
total loss:  9.864470481872559
total loss:  9.588332176208496
total loss:  9.330954551696777
total loss:  9.090641021728516
total loss:  8.865730285644531
total loss:  8.654861450195312
total loss:  8.456832885742188
total loss:  8.270547866821289
total loss:  8.095032691955566
total loss:  7.929440021514893
total loss:  7.772965431213379
total loss:  7.624858856201172
total loss:  7.4845476150512695
be
Variable containing:
 0.3002  0.1576  0.0690  1.5771 -0.7548  1.8519  0.3851  0.7977  0.3833 -0.5942
[torch.FloatTensor of size 1x10]

sunken
Variable containing:
 0.5422  0.0363  0.2102  0.7738  1.0226  1.4657  0.3946 -0.6255 -0.4813 -0.2855
[torch.FloatTensor of size 1x10]

praise
Variable containing:
-0.7531  0.2487  1.1879 -1.5669 -1.5215 -1.5482 -0.3227  0.5992  2.1116  0.7919
[torch.FloatTensor of size 1x10]

now,
Variable containing:
-0.5165  0.1584  0.2798  0.8650  0.3367  1.0248  1.1647  0.0435  0.9216 -1.4426
[torch.FloatTensor of size 1x10]

不过这里我还是有几个疑问：
1.torch.nn.Embedding到底是如何初始化的，是随机的还是按照一定方法初始化的。
参考：http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

PyTorch: N-Gram Language Modeling

PyTorch中Word Embeddings

N-Gram Language Modeling