gensim实现LDA主题模型-------实战案例（分析希拉里邮件的主题）

数据集下载：https://download.csdn.net/download/qq_41185868/10963668

第一步：加载一些必要的库，我们用的是gensim中的LDA模型，所以必须安装gensim库

import pandas as pd
import re
from gensim.models import doc2vec, ldamodel
from gensim import corpora

第二步：咱们看一下数据集，这里的数据集有20个特征，我们只取两个。 id 和邮件内容。

if __name__ == '__main__':
    # 加载数据
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 这两列主要有空缺值，这条数据就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)

分析上面代码：先用pandas加载数据集，然后取出那两列，如果哪一套数据有空值，我们直接扔掉。接下来打印前五行以及数据的规格

第三步：数据清洗

很明显看出数据里面的特殊字符比较多，如邮箱号，时间等，对咱们的主题生成没有什么帮助，所以，咱们必须将其清除掉。

def clean_email_text(text):
    # 数据清洗
    text = text.replace('\n', " ")  # 新行，我们是不需要的
    text = re.sub(r"-", " ", text)  # 把 "-" 的两个单词，分开。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text)  # 日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)  # 时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text)  # 邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)  # 网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter == ' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word) > 1)  # 而且单词长度必须是2以上
    return text

主要就是清除里面的特殊字符，最后再将长度小于1的词也扔掉，对主题生成没有什么帮助。记住：LDA是一个词袋模型。没有上下文的信息。

在我们的main里面调用上述方法进行数据清洗，也就是添加五行代码。最后docs.values是直接从表格中把数据拿出来，见一下输出：

if __name__ == '__main__':
    # 加载数据
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 这两列主要有空缺值，这条数据就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)
    
    # 新添加的代码
    docs = df['ExtractedBodyText']   # 获取邮件
    docs = docs.apply(lambda s: clean_email_text(s))   # 对邮件清洗

    print(docs.head(1).values)
    doclist = docs.values   # 直接将内容拿出来
    print(docs)

注意：一行代表的是一个邮件

第四步：进一步清洗：去除停用词

首先，我们本地有一个停用词表，我们加载进行，然后对文本进行停用词的去除

def remove_stopword():
    stopword = []
    with open('./data/stop_words.utf8', 'r', encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            line = line.replace('\n', '')
            stopword.append(line)
    return stopword

在我们的main中调用，再main中进行代码的添加：

    stop_word = remove_stopword()

    texts = [[word for word in doc.lower().split() if word not in stop_word] for doc in doclist]
    print(texts[0])  # 第一个文本现在的样子

我们这一步不管进行了停用词的处理，还将句子进行了分词，因为gensim中用LDA需要进行分词。所以我们这一步的输出是：（这里我们只打印了第一个文本）

第五步：开始准备模型进行训练

直接在main中添加以下代码：

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    print(corpus[0])  # [(36, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

    lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
    print(lda.print_topic(10, topn=5))  # 第10个主题最关键的五个词
    print(lda.print_topics(num_topics=20, num_words=5))  # 把所有的主题打印出来看看

解释一下以上代码：第一行是词空间的生成，也就是将所有文章中取出来去重，剩下的词组成的列表。并进行编号

第二行是针对每个文本，将词汇转为id 如上述第三行后面的注释，（36, 1）的意思就是你这篇文章有36号单词，这个单词在你这篇文章中出现了1次。

下面就是LDA模型的建立训练，注意里面传的参数，前两个不用讲，最后一个参数说一下，就是我们想让其生成几个主题，跟kmeans中需要指定簇的个数是一个意思。接着打印最能体现第十个主题的前五个单词。然后打印这20个主题相关的词。

第二行是一个列表，直截取了一部分。

第六步：保存模型，方面以后的使用

    # 保存模型
    lda.save('zhutimoxing.model')

第七步：加载模型，并给定一个新邮件，让其判断属于哪个主题

新邮件内容：'I was greeted by this heartwarming display on the corner of my street today. ' \ 'Thank you to all of you who did this. Happy Thanksgiving. -H'

    # 加载模型
    lda = ldamodel.LdaModel.load('zhutimoxing.model')

    # 新鲜数据，判读主题
    text = 'I was greeted by this heartwarming display on the corner of my street today. ' \
           'Thank you to all of you who did this. Happy Thanksgiving. -H'
    text = clean_email_text(text)
    texts = [word for word in text.lower().split() if word not in stop_word]
    bow = dictionary.doc2bow(texts)
    print(lda.get_document_topics(bow))  # 最后得出属于这三个主题的概率为[(4, 0.6081926), (11, 0.1473181), (12, 0.13814318)]

最后的输出：

可以看出，属于第17个主题的概率最大。值为0.5645

完整代码：

"""

@file   : 010-希拉里邮件进行主题建立之主题模型.py

@author : xiaolu

@time1  : 2019-05-11

"""
import numpy as np
import pandas as pd
import re
from gensim.models import doc2vec, ldamodel
from gensim import corpora



def clean_email_text(text):
    # 数据清洗
    text = text.replace('\n', " ")  # 新行，我们是不需要的
    text = re.sub(r"-", " ", text)  # 把 "-" 的两个单词，分开。（比如：july-edu ==> july edu）
    text = re.sub(r"\d+/\d+/\d+", "", text)  # 日期，对主体模型没什么意义
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)  # 时间，没意义
    text = re.sub(r"[\w]+@[\.\w]+", "", text)  # 邮件地址，没意义
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)  # 网址，没意义
    pure_text = ''
    # 以防还有其他特殊字符（数字）等等，我们直接把他们loop一遍，过滤掉
    for letter in text:
        # 只留下字母和空格
        if letter.isalpha() or letter == ' ':
            pure_text += letter
    # 再把那些去除特殊字符后落单的单词，直接排除。
    # 我们就只剩下有意义的单词了。
    text = ' '.join(word for word in pure_text.split() if len(word) > 1)  # 而且单词长度必须是2以上
    return text


def remove_stopword():
    stopword = []
    with open('./data/stop_words.utf8', 'r', encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            line = line.replace('\n', '')
            stopword.append(line)
    return stopword


if __name__ == '__main__':
    # 加载数据
    df = pd.read_csv('./data/HillaryEmails.csv')
    df = df[['Id', 'ExtractedBodyText']].dropna()  # 这两列主要有空缺值，这条数据就不要了。
    print(df.head())
    print(df.shape)   # (6742, 2)

    docs = df['ExtractedBodyText']   # 获取邮件
    docs = docs.apply(lambda s: clean_email_text(s))   # 对邮件清洗

    print(docs.head(1).values)
    doclist = docs.values   # 直接将内容拿出来
    print(docs)

    stop_word = remove_stopword()

    texts = [[word for word in doc.lower().split() if word not in stop_word] for doc in doclist]
    print(texts[0])  # 第一个文本现在的样子

    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]
    print(corpus[0])  # [(36, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

    lda = ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)
    print(lda.print_topic(10, topn=5))  # 第10个主题最关键的五个词
    print(lda.print_topics(num_topics=20, num_words=5))  # 把所有的主题打印出来看看

    # 保存模型
    lda.save('zhutimoxing.model')

    # 加载模型
    lda = ldamodel.LdaModel.load('zhutimoxing.model')

    # 新鲜数据，判读主题
    text = 'I was greeted by this heartwarming display on the corner of my street today. ' \
           'Thank you to all of you who did this. Happy Thanksgiving. -H'
    text = clean_email_text(text)
    texts = [word for word in text.lower().split() if word not in stop_word]
    bow = dictionary.doc2bow(texts)
    print(lda.get_document_topics(bow))  # 最后得出属于这三个主题的概率为[(4, 0.6081926), (11, 0.1473181), (12, 0.13814318)]