There are many emails downloaded from Internet.

The emails are labeled with spam or ham.

This article will demonstrate the theory of Naive Bayes Filter.

Email Data

There are n labeled emails with corresponding labels Li .

Li could be Spam or ham.

We also have a dictionary of J words.

yij means whether word j in i th email.

if it is in i th email, yij should be 1.

Otherwise , yij should be 0.

Naive Bayes Model

  1. set Pr(L=spam)=s and Pr(L=ham)=1s
  2. for j=1,...,J
    if L=ham , set Pr(Yj=1)=pj,Pr(Yj=0)=1pj ,
    if L=spam , set Pr(Yj=1)=qj,Pr(Yj=0)=1qj .

We shall assume that our training data (li,yi) for each i are i.i.d according to the above description.
We have known that given L , all Yj s are independent.

Parameter Learning

We need to estimate the parameters

θ={s,p1,...,pj,q1,...,qj}

To learn parameters, we find θ that maximizes the likelihood.
θ^=argmaxθi=1nPL,Y1,...,YJ(li,y1,...,yJ;θ)

P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ)

Because of P(l1,y11,...,y1J,...,ln,yn1,...,ynJ;θ) i,i,d.
then
i=1nPL,Y1,...,YJ(li,y1,...,yJ;θ)=i=1n[PL(li;θ)j=1JPYj|L(yij|li;θ)]

Given θ
then

PL(l;θ)=s1{l=spam}(1s)1{label=ham}

if L = ham
PYj|L(yj|l;θ)=pyjj(1pj)(1yj)

else if L = spam
PYj|L(yj|l;θ)=qyjj(1qj)(1yj)

use log function in

log(i=1n[PL(li;θ)j=1JPYj|L(yij|li;θ)])

=i=1nlogPL(li;θ)+i=1nj=1JlogPYj|L(yij|li;θ)

In the above equation, we can degrade the first one by:
i=1nlogPL(li;θ)=(i=1n1{Li=spam})log(s)+(i=1n1{Li=ham})log(1s)

=Alog(s)+Blog(1s)

So, we assume A=ni=11{Li=spam} and B=ni=11{Li=ham} for simplification.

Then, we simplify the second factor:

i=1nj=1JlogPYj|L(y(i)j|l(i);θ)

=i=1n1{L(i)=ham}j=1Jlog(py(i)jj(1pj)1y(i)j)+i=1n1{L(i)=spam}j=1Jlog(qy(i)jj(1qj)1y(i)j)

using (i) means it is the i th sample, not the exponential index.
=i=1n1{L(i)=ham}j=1J[y(i)jlog(pj)+(1y(i)j)log(1pj)]+i=1n1{L(i)=spam}j=1J[y(i)jlog(qj)+(1y(i)j)log(1qj)]

=i=1nj=1J1{L(i)=ham}[y(i)jlog(pj)+(1y(i)j)log(1pj)]+i=1nj=1J1{L(i)=spam}[y(i)jlog(qj)+(1y(i)j)log(1qj)]

=j=1J[(i=1n1{L(i)=ham}y(i)j)log(pj)+(i=1n1{L(i)=ham}(1y(i)j))log(1pj)+(i=1n1{L(i)=spam}y(i)j)log(qj)+(i=1n1{L(i)=spam}(1y(i)j))log(1qj)]

For simplification, we assume that:

i=1n1{L(i)=ham}y(i)j=Aj

i=1n1{L(i)=ham}(1y(i)j)=Bj

i=1n1{L(i)=spam}y(i)j=Cj

i=1n1{L(i)=spam}(1y(i)j)=Dj

Then, we have :
Aj+Bj=i=1n1{L(i)=ham}

Cj+Dj=i=1n1{L(i)=spam}

Find Optimal Value

there is a function

f(x)=Mln(x)+Nln(1x)

, then its first derivative is

fx=MxN1x=MMxNxx(1x)
.

To get the optimal of f , we make

fx=0
,

then

x^=MM+N
.

so, we can apply the above theory to get parameter θ .

s^=AA+B

pj^=AjAj+Bj

qj^=CjCj+Dj

Prediction

Given a sample with Y1,Y2,...,YJ words, the probability of that it is a spam is based on:

PL|Y1,Y2,...,YJ(l|y1,y2,...,yJ)=PL(L)PY1,Y2,...,Yj|L(y1,y2,...,yJ|l)PY1,Y2,...,YJ(y1,y2,...,yJ)

PL|Y1,Y2,...,YJ(l=spam|y1,y2,...,yJ)=sJj=1qyjj(1qj)1yjK

PL|Y1,Y2,...,YJ(l=ham|y1,y2,...,yJ)=(1s)Jj=1pyjj(1pj)1yjK

where K=sJj=1qyjj(1qj)1yj+(1s)Jj=1pyjj(1pj)1yj .

So, we can determine whether an email is a spam or a ham by:

Z=s1sJj=1qyjj(1qj)1yjJj=1pyjj(1pj)1yj

if Z1 , then it is a spam.
However, in numerical calculation, we should use
ln(z)=ln(s)ln(1s)+j=1J[yj(ln(qj)ln(pj))+(1yj)(ln(1qj)ln(1pj))]

if Z0 , then it is a spam.Otherwise , it is a ham.

Laplace Smoothing

Have we finished? There is a corner case, where there is a word that does not appear in any training samples. To handle this problem, we use Laplace Smoothing Coefficient lap .
Then the parameters are:

s^=A+lapA+B+2lap

pj^=Aj+lapAj+Bj+2lap

qj^=Cj+lapCj+Dj+2lap

File Arrangement

There are two .py files and one data directory in current work-space.

In data directory, there are 3 sub directories ham,spam,
tesing.

In ham, the text files are organized like this:

In spam, the text files are organized like this:

In testing, the text files are organized like this:

Each email is like that:
.
All words in email are separated by a space, even a punctuation.

code

naivebayes.py


import sys
import os.path
import collections
import math
import util
import numpy as np

USAGE = "%s <test data folder> <spam folder> <ham folder>"

def get_counts(file_list):
    """ Computes counts for each word that occurs in the files in file_list. Inputs ------ file_list : a list of filenames, suitable for use with open() or util.get_words_in_file() Output ------ A dict whose keys are words, and whose values are the number of files the key occurred in. """
    words = []
    for filename in file_list:
        words.extend(list(set(util.get_words_in_file(filename))))
    counter = collections.Counter(words)
    return counter


def get_log_probabilities(file_list):
    """ Computes log-frequencies for each word that occurs in the files in file_list. Input ----- file_list : a list of filenames, suitable for use with open() or util.get_words_in_file(). Output ------ A dict whose keys are words, and whose values are the log of the smoothed estimate of the fraction of files the key occurred in. Hint ---- The data structure util.DefaultDict will be useful to you here, as will the get_counts() helper above. """
    counter = get_counts(file_list)
    num_files = len(file_list)
    for key in counter:
        counter[key] = math.log((counter[key]+1) / (num_files+2))
    return counter



def learn_distributions(file_lists_by_category):
    """ Input ----- A two-element list. The first element is a list of spam files, and the second element is a list of ham (non-spam) files. Output ------ (log_probabilities_by_category, log_prior) log_probabilities_by_category : A list whose first element is a smoothed estimate for log P(y=w_j|c=spam) (as a dict, just as in get_log_probabilities above), and whose second element is the same for c=ham. log_prior_by_category : A list of estimates for the log-probabilities for each class: [est. for log P(c=spam), est. for log P(c=ham)] """
    spam_file_list, ham_file_list = file_lists_by_category
    spam_counter = get_log_probabilities(spam_file_list)
    length_spam = len(spam_file_list)
    length_ham = len(ham_file_list)
    ham_counter = get_log_probabilities(ham_file_list)
    all_set = spam_counter.keys() | ham_counter.keys()
    for word in all_set:
        if word not in spam_counter:
            spam_counter[word] = math.log(1.0/(length_spam+2)) # smooth
        if word not in ham_counter:
            ham_counter[word] = math.log(1.0/(length_ham+2)) # smooth

    n_total = length_spam + length_ham
    return ([spam_counter, ham_counter],
            [math.log(length_spam*1.0/n_total), math.log(length_ham*1.0/n_total)])

def classify_email(email_filename, log_probabilities_by_category, log_prior_by_category):
    """ Uses Naive Bayes classification to classify the email in the given file. Inputs ------ email_filename : name of the file containing the email to be classified log_probabilities_by_category : See output of learn_distributions log_prior_by_category : See output of learn_distributions Output ------ One of the labels in names. """
    words = set(util.get_words_in_file(email_filename))
    # prob_log_spam, prob_log_ham = log_prior_by_category
    spam_counter, ham_counter = log_probabilities_by_category
    prob_log_spam = -9.0
    prob_log_ham = math.log(1-math.exp(prob_log_spam))
    spam_log_sum = prob_log_spam
    ham_log_sum = prob_log_ham
    print("log(s) = {0}, log(1-s) = {1}".format(prob_log_spam, prob_log_ham))
    for word in spam_counter:
        if word in words:
            spam_log_sum += spam_counter[word]
        else:
            spam_log_sum += math.log(1 - math.exp(spam_counter[word]))
    for word in ham_counter:
        if word in words:
            ham_log_sum += ham_counter[word]
        else:
            ham_log_sum += math.log(1 - math.exp(ham_counter[word]))

    if spam_log_sum >= ham_log_sum:
        return 'spam'
    else:
        return 'ham'

def classify_emails(spam_files, ham_files, test_files):
    ''' compute the label of each email in test_files. return value: List,such as ['spam','ham', 'spam', 'ham',....] '''
    log_probabilities_by_category, log_prior = \
        learn_distributions([spam_files, ham_files])
    estimated_labels = []
    for test_file in test_files:
        estimated_label = \
            classify_email(test_file, log_probabilities_by_category, log_prior)
        estimated_labels.append(estimated_label)
    return estimated_labels

def main():
    ''' usage: $python naivebayes.py data/testing/ data/spam/ data/ham/ '''
    ### Read arguments
    if len(sys.argv) != 4:
        print(USAGE%sys.argv[0])
    testing_folder = sys.argv[1]
    (spam_folder, ham_folder) = sys.argv[2:4]

    ### Learn the distributions
    file_lists = []
    for folder in (spam_folder, ham_folder):
        file_lists.append(util.get_files_in_folder(folder))
    (log_probabilities_by_category, log_priors_by_category) = \
            learn_distributions(file_lists)

    # Here, columns and rows are indexed by 0 = 'spam' and 1 = 'ham'
    # rows correspond to true label, columns correspond to guessed label
    performance_measures = np.zeros([2, 2])

    ### Classify and measure performance
    for filename in util.get_files_in_folder(testing_folder):
        ## Classify
        label = classify_email(filename,
                               log_probabilities_by_category,
                               log_priors_by_category)
        ## Measure performance
        # Use the filename to determine the true label
        base = os.path.basename(filename)
        true_index = ('ham' in base)
        guessed_index = (label == 'ham')
        performance_measures[true_index, guessed_index] += 1


        # Uncomment this line to see which files your classifier
        # gets right/wrong:
        # print("%s : %s" %(label, filename))

    template = "You correctly classified %d out of %d spam emails, and %d out of %d ham emails."
    # Correct counts are on the diagonal
    correct = np.diag(performance_measures)
    # totals are obtained by summing across guessed labels
    totals = np.sum(performance_measures, 1)
    print(template % (correct[0],
                      totals[0],
                      correct[1],
                      totals[1]))

if __name__ == '__main__':
    main()

util.py

import os

def get_words_in_file(filename):
    """ Returns a list of all words in the file at filename. """
    with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
        # read() reads in a string from a file pointer, and split() splits a
        # string into words based on whitespace
        words = f.read().split()
    return words

def get_files_in_folder(folder):
    """ Returns a list of files in folder (including the path to the file) """
    filenames = os.listdir(folder)
    # os.path.join combines paths while dealing with /s and \s appropriately
    full_filenames = [os.path.join(folder, filename) for filename in filenames]
    return full_filenames