Assignment 4 by group 30
Authors Olle Nilsson and Thomas Alexandersson
Thursday: 9:45 - 12:00, 13:30 - 15:45
Friday: 9:00 - 12:00, 13:30 - 15:00
Weekend: 2 hours
Monday: 13:00 - 16:00
Hours worked on this assignement = x hours per member
#DAT405 Introduction to Data Science and AI ##2020-2021, Reading Period 2
Assignment 4: Spam classification using Naïve Bayes
There will be an overall grade for this assignment. To get a pass grade (grade 5), you need to pass items 1-3 below. To receive higher grades, finish items 4 and 5 as well.
The exercise takes place in a notebook environment where you can chose to use Jupyter or Google Colabs. We recommend you use Google Colabs as it will facilitate remote group-work and makes the assignment less technical.
You can execute certain linux shell commands by prefixing the command with
!. You can insert Markdown cells and code cells. The first you can use for documenting and explaining your results the second you can use writing code snippets that execute the tasks required.
In this assignment you will implement a Naïve Bayes classifier in Python that will classify emails into spam and non-spam (“ham”) classes. Your program should be able to train on a given set of spam and “ham” datasets. You will work with the datasets available at https://spamassassin.apache.org/old/publiccorpus/. There are three types of files in this location:
- easy-ham: non-spam messages typically quite easy to differentiate from spam messages.
- hard-ham: non-spam messages more difficult to differentiate
- spam: spam messages
Execute the cell below to download and extract the data into the environment of the notebook -- it will take a few seconds. If you chose to use Jupyter notebooks you will have to run the commands in the cell below on your local computer, with Windows you can use 7zip (https://www.7-zip.org/download.html) to decompress the data.
The data is now in the three folders
- Note that the email files contain a lot of extra information, besides the actual message. Ignore that for now and run on the entire text. Further down (in the higher-grade part), you will be asked to filter out the headers and footers.
- We don’t want to train and test on the same data. Split the spam and the ham datasets in a training set and a test set. (
Initially no data was filtered out and the data was split in 70% percent training data and 30% test data.
2.Write a Python program that:
- Uses four datasets (
- Trains a Naïve Bayes classifier (e.g. Sklearn) on
spamtrain, that classifies the test sets and reports True Positive and False Negative rates on the
spamtestdatasets. You can use
CountVectorizerto transform the email texts into vectors. Please note that there are different types of Naïve Bayes Classifier in SKlearn (Documentation here). Test two of these classifiers that are well suited for this problem
- Multinomial Naive Bayes
- Bernoulli Naive Bayes.
Please inspect the documentation to ensure input to the classifiers is appropriate. Discuss the differences between these two classifiers.
Naive bayes classification takes a word occurance matrix and classifies every word independable from each other and stores it as a probability value list. Multinominal naive bayes uses that the frequence of given words has been used in previous spam and real mail to predict a probability value from 0 to 1. Bernoulli naive bayes labels every word as a binary value to determine if a word is seen in spam or real mails. The main difference between the two models are that Multinominal counts the number of occurance of words to classify mails while Bernoulli looks for the presence or absence of words to determine the classification. In general Bernoulli is supossed to be better on shorter documents, which mails are.
3.Run your program on
- Spam versus easy-ham
- Spam versus hard-ham.
Spam vs mail
When running the model using easy-ham vs spam mail the multinominal classification model had almost an 100% successrate labeling easy_ham as not spam. However when predicting spam mail it missed 20%. Bernoulli showed similar results on easy_ham classifications however on spam mail it wrongly classified 50% of the mails which must be considered a major flaw. Running the models on spam versus hard-ham both models have no trouble labeling spam mails correctly with bernoulli only misslabeling one spam mail. Both Models lose accuracy correctly classifiying mails when using "hard mails" though with multinominal acheiveing a 87% accuracy and bernoulli only reaching an accuracy of 71%.
Looking at a couple of samples of the easy mail category most of the mails appear to have a similar format and with such a heavy number of samples (2500) compared to the 700 spam mails both models tend to skew towards labeling the test mails as real mails. The models classification when using the hard mail dataset are instead more skewed towards spam mail labeling which makes sense since the dataset contains 250 mails compared to the 700 spam mails.
4.To avoid classification based on common and uninformative words it is common to filter these out.
a. Argue why this may be useful. Try finding the words that are too common/uncommon in the dataset.
b. Use the parameters in Sklearn’s
CountVectorizer to filter out these words. Update the program from point 3 and run it on your data and report your results.
You have two options to do this in Sklearn: either using the words found in part (a) or letting Sklearn do it for you. Argue for your decision-making.
Sorting out common words can help increase the accuracy of the model. By removing words that hold little meaning (the, and, from etc and hopefully parts of the headers) predictions can be made on the actual content of the mail. This would be better than using all the text in the mails since there's a lot of boilerplate.
Running the program again with the optimal values for min and max df yields overall better results than using the default values. The arithemtic mean of the default values are 0.85 compared to 0.90 for the optimal values. The values calcuated are df_min=0.05 and df_max=0.3. In practice this means that words that appear in less than 15% or more than 30% of all the mails are not used by the model when training and predicting. Below are the confusion matrices with the updated values.
At first glance the upper limit may seem low but during our testing it yielded the best result. It seemed easier to filter words this way instead of manually looking for a word's occurance. You could take the vocabulary from the vectorizer, sort it and look at the most and least common words. However, that is essentially the same thing as using the built in functionality but more time consuming.
Both models acheive a greatly improved result labeling spam mail with easy-ham vs spam mail with heavy filtering. They also lose some accuracy with labeling mail as real mail but it is a really minor loss compared to the benefit of spam mail labeling. Applying both models on the hard-ham vs spam datasets the results show similar patterns as with the original models. The spam mail labeling lose a 5% accuracy using the bernoulli model and close to a 3% accuracy with the multinominal model. Bernoulli improves its mail labeling with 7% though while multinominal shows the same results compared to the original model.
With cleaning out underused and overused words the models have a much better balance with labeling mails as real mail and spam mail. Since as previously mentioned the format of the test samples tend to be quite similar the big outliers are not accounted for which really shows in the results.
5.Eeking out further performance
Filter out the headers and footers of the emails before you run on them. The format may vary somewhat between emails, which can make this a bit tricky, so perfect filtering is not required. Run your program again and answer the following questions:
- Does the result improve from 3 and 4?
- The split of the data set into a training set and a test set can lead to very skewed results. Why is this, and do you have suggestions on remedies?
- What do you expect would happen if your training set were mostly spam messages while your test set were mostly ham messages?
Re-estimate your classifier using
fit_prior parameter set to
false, and answer the following questions:
- What does this parameter mean?
- How does this alter the predictions? Discuss why or why not.
To remove headers and other "junk" from the mail a function was created. It goes through each mail and removes all commas, periods, exclamation points and double quotation marks from the mail. Then it splits at every whitespace and goes through them in chunks of 3. If those 3 words are either numbers or alpabetical characters the words are kept. Otherwise they are discarded. During out testing this sorts out most of the headers from the mails with great results. Below is a unfiltered mail and its filtered equivalent to showcase the filtering.
A unfiltered mail
Its filtered version
Running the same function as earlier to obtain the optimal max and min df the probability were still lower than using the unfiltered mail. It did improve its results compared to 3 though. Notable is that both the upper and lower limits changed when using the filtered list. Once again the multinomial distribution performed better than the bernoulli.
The same pattern can be seen here as with the original model with both models being able to label easy-ham with great accuracy while having trouble labeling spam correctly, and the reverse for hard-ham.
Compared to the unfiltered data sets with underused/overused words filtering this model shows worse results. The conclusion to this must be that the filtering model is deeply flawed on such a big sample size and that the headers and footers contain so much valuable information for the models that can be used to make accurate predictions.
Overall the multinominal classification model showed better results then the bernoulli but not with a large margin. The multinominal model is more robust then the bernoulli model by taking into account the actual word count instead of only looking at the presence of select words which shows in the results.
To improve the model further the data sets used to train the model should be of equal size to not skew the models
What to report and how to hand in.
- You will need to clearly report all results in the notebook in a clear and appropriate way, either using plots or code output (f.x. "print statements").
- The notebook must be reproducible, that means, we must be able to use the
Run allfunction from the
Runtimemenu and reproduce all your results. Please check this before handing in.
- Save the notebook and share a link to the notebook (Press share in upper left corner, and use
Get linkoption. Please make sure to allow all with the link to open and edit.
- Edits made after submission deadline will be ignored, graders will recover the last saved version before deadline from the revisions history.
- Please make sure all cells are executed and all the output is clearly readable/visible to anybody opening the notebook.