准备输入

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['first', 'sentences'], ['second', 'sentence']]
# 创建模型并训练
model = gensim.models.Word2Vec(sentences, min_count=1)

上面这种创建模型的方法, 是把所有的训练语料先全部加载进内存. 如果语料很大, 内存很容易爆掉.

然后发现gensim是提供了解决方法的, 可以迭代的把语料一条条输入进去, 来看看官方的例子

class MySenteces():
	def __init__(self, dirname):
		self.dirname = dirname
	
	def __iter__(self):
		for fname in os.listdir(self.dirname):
			for line in open(os.path.join(self.dirname, fname)):
				yield line.split()

sentences = MySentences('/some/directory')
model = gensim.models.Word2Vec(sentences)

另外, 如果我们需要对语料数据做些预处理, 都可以把它封装到MySentences类里

加载txt类型语料, 也可以直接用gensim提供的函数:

from gensim.models import word2vec
sentences = word2vec.Text8Corpus("a.txt") # 注意要先分词

训练

相关参数

  • min_count: 出现次数小于min_count的词将被忽略
model = Word2Vec(sentences, min_count=10) # default value is 5
  • size: 词向量维度
model = Word2Vec(sentences, size=300)  # default value is 100
  • workers: 并行训练数
    注意: 想要并行训练必须先按照Cpython
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
  • sg: 选择使用cbow还是skip-gram
  • hs: 选择使用hierarchical softmax还是negative sample

模型评估

因为Word2vec是无监督算法, 所以不太好直接评估.
Google开源了20000个语义和语法测试用例, 类似Word2vec论文的做法–“A-B=C-D”.
gensim也提供了评估函数(因为测试用例都是英文, 所以中文词向量不行)
这里要先下载Google这个测试集

model.accuracy('/tmp/questions-words.txt')
# 会打印如下信息
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

模型保存和加载

保存

# 保存
model.save(file_path)
# 按类型保存
model.save_word2vec_format('/tmp/mymodel.txt',binary = False)
model.save_word2vec_format('/tmp/mymodel.bin.gz',binary = True)

加载

# 加载
mew_model = gensim.models.Word2Vec.load(file_path)

# 按模型文件类型加载
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip:
model = gensim.models.KeyedVectors.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

注意: 按类型保存和加载也就是后两种方法丢失了词汇树等部分信息, 不能追加训练

追加训练

保存的模型还可以利用新语料继续训练

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

模型使用

词相似性

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
# 输出: [('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
model.similarity('woman', 'man')
0.73723527

单个词的词向量

model.wv['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

计算两个集合的余弦相似度

注意:当某个词语不在这个训练集合中的时候,会报错!

list1 = ['我''走''我''学校'] 
list2 = ['我''去''家'] 
list_sim1 = model.n_similarity(list1,list2)