Assignment 4: Spam classification using Naïve Bayes
by Cleo Matzken and Matthieu Moutot - Group 80 (~12 hrs each)
Preprocessing
First, let's import several libraries that will be used throughout our project.
We can import the data thanks to Sklearn's method load_files. We will work with two sets of data: one containing the easy-ham emails and the spam emails, the other containing the hard-ham emails and the spam emails too.
Then, we use the train_test_split method to split the data from each set into training and testing data. To have a better representation of how the data is organized now, we can imagine that the X-sets contain all the emails (both ham and spam, the .data), and the Y-sets the information whether they are ham or spam (the target).
Classification with Naïve Bayes
We are now ready to make a program that will use the NB-classifiers from Sklearn in order to label the emails in the spam or ham categories. We will use two functions: the main one - 'classify' - that will transform the email from text to vectors, train the classifier and then make a prediction with the test data, and another one called 'stats' that will analyze the results and see if how accurate the prediction was.
Discuss the differences between these two classifiers
In order to understand the difference between the Multinomial and Bernoulli classifiers, we should first take a look at how CountVectorizer works. This tool is transforming text emails into a vector associating every distinct word to its frequency count. In other words, a text input such as "A B C C C" would be transformed into a vector like [A B C; 1 1 3]. The CountVectorizer is run on the whole training set and creates a sparse matrix whose rows are email indices, columns are every distinct words found, and cells the frequency count of a given word in a given email.
Then, the Multinomial Naïve Bayes classifier will take into account the frequency of every word in the data set and calculate a probability for them knowing they are in a spam, and then in a real email. Then, for a new email, the prediction can be made by adding the probabilities of every one of its words knowing they are in a spam, and then knowing they are in a real email, and choosing the greater.
To work with the Bernoulli classifier, we have to activate the parameter 'binary' in the CountVectorizer method. The main difference between the Multinomial case is that the frequency count can only be 0 or 1. Whether a given word is present 100 times in the emails or only once will not make any difference. Thus, we can expect that the Bernoulli classifier will give less accurate results considering that we are loosing information related to the frequency of every words.
Application on data sets
We are now ready to run our program and see how accurate its prediction is. We will first consider the set with easy-hams and spams, and then the hard-ham versus spam one.
Test on easy-ham vs spam data
We run the program with the Multinomial classifier.
With the Bernoulli classifier now.
As expected, the Multinomial classifier gives more accurate results in general. We can also observe that both classifiers have more trouble identifying spam than ham: more spam emails are considered as real ones than real emails considered as spams. This is of course an issue for the end user, but it would be even more annoying to have real emails filtered out.
Test on hard-ham vs spam data
Let's see how our program performs with the hard-ham versus spam data set. We first use it with the Multinomial classifier
And then with the Bernoulli one.
Here again, the Multinomial classifier yields better results as expected. However, in this case, both classifiers identify way more real emails as spam than spam as real emails. This could be explained by the size of the different folders: the easy-ham contains 2551 emails, the hard-ham 250, and the spam folder contains 500. In the first data set with easy-ham and spam, there is a lot of data regarding the easy-ham which enables the classifier to be well trained to recognize them, explaining more errors when identifying the spams. On the contrary, in the hard-ham and spam set, there is more data regarding the spam so the classifier is better at identifying spams, hence more errors when it's dealing with hard-hams.
Filtering useless words
We might be able to get a more efficient classification by filtering certain words that are uninformative. Indeed, common words such as 'and', 'the' or 'is' are very likely to be in both a spam or a real email, and they do not carry any worthy meaning. By removing them, we can give more importance to 'useful' words. If we consider one of them, then for its given frequency count, the classifier will divide by it by a smaller number (since less words in the category ham or spam), and its probability knowing that it is a spam/ham will increase. The same reasoning applies for very uncommon words that appear in such a lows frequency that it is not possible to get any relevant insight from them.
There are several options to filter certain words: the first one would be to provide a custom list of very common/uncommon words such as 'from', 'return', 'received', 'content', 'subject', 'mail', 'to', 'and', 'list', 'localhost', 'by', 'for', 'http', the dates, and words appearing less than 3-4 times in total.
We could also use the 'stop_words' parameter of CountVectorizer with the 'English' custom list that automatically removes a set of words commonly used in English. However, this option could remove certain words that could be meaningful in our case and it's letting through some other words that could be worth to remove.
Thus, we decide to use the max_df and min_df parameters: they automatically ignores words that have a document frequency greater/smaller than a given threshold. Let's implement it to our program and see if we can get better results.
We can see that if we remove every word that has a document frequency of more than 70% or less than 0.5%, the accuracy of every classifier is improved: +2,13 pts for Multinomial - Easy ; +8,62 pts (!) for Bernoulli - Easy ; +2,66 pts for Multinomial - Hard ; +0.66 pts for Bernoulli Hard. The filtering is probably more efficient on the easy-ham vs spam data set since the number of emails is way larger than the hard data set (3051 vs 750), thus the vocabulary is still quite important even if we have removed some words.
Note: this way of filtering the words is relevant here because the classifiers don't take into account the order in which the words are - they don't have the notion of a sentence and its meaning.
Filtering headers & footers
What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages?
In this case, the classifier would be well trained to identify spam messages, and thus it would have more trouble to classify the real emails: we would have a lot of real emails classified as spams.