Assignment 4 - Bayesian models
By Julia Jönmark and Hanna Söderström
Hours spent: 10h each
Question 1
To answer this question we needed to fetch the files and put its information in to a dataframe.
For the first question we took inspiration from the code from Grepper (found on this link: https://www.codegrepper.com/code-examples/shell/unzip+tar.gz+using+python) to be able to access the information from the .tar files. Then we realized that the files were not read correctly and this was fixed by adding 'latin-1' according to this link: https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte.
Below we can see how the data frame 'df' looks
For part B in the first question we wanted to split the dataset into training and test sets. We use a train/test split of 70-30.
Question 2
We start of with the train and test sets from the previous question and use it for the vectorization of the emails.
To vectorize the email texts we used the code from scikit learn as inspiration (found on this link: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=count%20vectorizer#sklearn.feature_extraction.text.CountVectorizer)
When we have vectorized the email texts, we create the classifiers and run them. The code is taken from naïve_bayes_intro.ipynb provided in the course (we have changed to the right classifier here).
We then want to print out the results with the percentage of ham and spam tests set that were classified correctly. This is done by creating confusion matrixes (code for this is taken from this link: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html). To calculate the true positive and true negative we took inspiration from Grepper again (code found on this link: https://www.codegrepper.com/code-examples/python/precision+and+recall+from+confusion+matrix+python) and calculated more exact values.
According to the picture provided in Lecture 4 (see below), we can analyze the matrices.
From the matrix with the Bernoulli classifiers we get: correct identification of ham: 99.8% correct identification of spam: 25.2% _ From the matrix with the Multinomial classifiers we get: correct identification of ham: 99.2% correct identification of spam: 90.2%
From this we get that the the Multinomial classifier is better for identifying spam/ham emails for this dataset. We can also see from the matrices above that the amount of false idenification of ham was far greater with the Bernoulli classifier than with the Multinomial classifier. Since the multinomial classifier takes the number of times a word appears into consideration (unlike the Bernoulli) we can conclude that the number of times a word appear matters for this dataset.
Question 3
We begin by seperating the easy/hard ham types into two different data sets together with the spam. This will be the data sets that we use later to determine the difference between easy ham vs spam and hard ham vs spam.
Then we did same as we have done before but we used different data frames in this case.
From the above matrices we get the following:
For easy ham: correct identification of ham with MultinomialNB: 99.8% correct identification of spam with MultinomialNB: 88.3% correct identification of ham with BernoulliNB: 100% correct identification of spam with BernoulliNB: 57.6% _ For hard ham: correct identification of ham with MultinomialNB: 78% correct identification of spam with MultinomialNB: 99.3% correct identification of ham with BernoulliNB: 50% correct identification of spam with BernoulliNB: 97.9%
From this we can conclude that the multinomial classifier is better than bernoulli regarding both easy ham and hard ham. For easy ham, we can see that the bernoulli classifier only gives the correct information 57.6% of the times for spam emails when the multinomial classifier gives 88.3% correct information for spam emails. For hard ham the the bernoulli classifier only gives the correct information 50% of the times for ham emails when the multinomial classifier gives 78% correct information for ham emails. That is a significant difference in both cases where the multinomial classifier comes out on top.