Assignment 4 - Bayesian models 

By Julia Jönmark and Hanna Söderström 

Hours spent: 10h each

import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split import tarfile

Question 1

To answer this question we needed to fetch the files and put its information in to a dataframe.

For the first question we took inspiration from the code from Grepper (found on this link: https://www.codegrepper.com/code-examples/shell/unzip+tar.gz+using+python) to be able to access the information from the .tar files. Then we realized that the files were not read correctly and this was fixed by adding 'latin-1' according to this link: https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte.

#importing the files tar1 = tarfile.open('20030228_spam.tar.bz2', 'r:bz2') tar2 = tarfile.open('20030228_easy_ham_2.tar.bz2', 'r:bz2') tar3 = tarfile.open('20030228_hard_ham.tar.bz2', 'r:bz2') #arrays to gather data to the dataframe newData = [] spam_ham = [] types = [] #array with all the imported files tar = [tar1, tar2, tar3] #extracting the requested information from all three files for t in tar: for member in t.getmembers(): f = t.extractfile(member) if f is not None: content = f.read() newData.append({'filename' : member, 'content' : content.decode('latin-1')}) if t == tar1: spam_ham.append('spam') types.append('spam') elif t == tar2: spam_ham.append('ham') types.append('easy ham') else: spam_ham.append('ham') types.append('hard ham') t.close() #adding the extracted information to the data frame df df = pd.DataFrame(newData) df['class'] = spam_ham df['types'] = types

Below we can see how the data frame 'df' looks

df

For part B in the first question we wanted to split the dataset into training and test sets. We use a train/test split of 70-30.

ham_train, ham_test, spam_train, spam_test = train_test_split(df['content'], df['class'], test_size=0.3,random_state=109)

Question 2

We start of with the train and test sets from the previous question and use it for the vectorization of the emails.

To vectorize the email texts we used the code from scikit learn as inspiration (found on this link: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=count%20vectorizer#sklearn.feature_extraction.text.CountVectorizer)

from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() ham_train_vector = vectorizer.fit_transform(ham_train) ham_test_vector = vectorizer.transform(ham_test)

When we have vectorized the email texts, we create the classifiers and run them. The code is taken from naïve_bayes_intro.ipynb provided in the course (we have changed to the right classifier here).

from sklearn.naive_bayes import MultinomialNB, BernoulliNB from sklearn import metrics # Multinomial Naïve Bayes Classifier #Create a Naïve Bayes Classifier and train the model using the training sets mnb = MultinomialNB().fit(ham_train_vector, spam_train) #Predict the response for test dataset spam_pred = mnb.predict(ham_test_vector) # Model Accuracy, how often is the classifier correct? print("Accuracy for Multinomial:",metrics.accuracy_score(spam_test, spam_pred)) # Bernoulli Naïve Bayes Classifier #Create a Naïve Bayes Classifier and train the model using the training sets (to binarize the features we set binarize = 0) bnb = BernoulliNB(binarize=0).fit(ham_train_vector, spam_train) #Predict the response for test dataset spam_pred2 = bnb.predict(ham_test_vector) # Model Accuracy, how often is the classifier correct? print("Accuracy for Bernoulli:",metrics.accuracy_score(spam_test, spam_pred2))

We then want to print out the results with the percentage of ham and spam tests set that were classified correctly. This is done by creating confusion matrixes (code for this is taken from this link: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html). To calculate the true positive and true negative we took inspiration from Grepper again (code found on this link: https://www.codegrepper.com/code-examples/python/precision+and+recall+from+confusion+matrix+python) and calculated more exact values.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay #Creation of confusion matrix for the Bernoulli Naïve Bayes Classifier cmp1 = ConfusionMatrixDisplay.from_estimator( bnb, ham_test_vector, spam_test, display_labels=['ham', 'spam'], normalize='true', ) cmp1.ax_.set_title('Confusion Matrix - BernoulliNB') #Calculate true positives and true negatives for the Bernoulli Naïve Bayes Classifier tn1, fp1, fn1, tp1 = confusion_matrix(spam_test, spam_pred2).ravel() truepositive = tn1 / (tn1 + fp1) truenegative = tp1 / (tp1 + fn1) print('True positive for Bernoulli: ' + str(truepositive)) print('True negative for Bernolli: ' + str(truenegative)) #Creation of confusion matrix for the Multinomial Naïve Bayes Classifier cmp2 = ConfusionMatrixDisplay.from_estimator( mnb, ham_test_vector, spam_test, display_labels=['ham', 'spam'], normalize='true', ) cmp2.ax_.set_title('Confusion Matrix - MultinomialNB') #Calculate true positives and true negatives for the Multinomial Naïve Bayes Classifier tn2, fp2, fn2, tp2 = confusion_matrix(spam_test, spam_pred).ravel() truepositive2 = tn2 / (tn2 + fp2) truenegative2 = tp2 / (tp2 + fn2) print('True positive for Multinomial: ' + str(truepositive2)) print('True negative for Multinomial: ' + str(truenegative2))

According to the picture provided in Lecture 4 (see below), we can analyze the matrices.

From the matrix with the Bernoulli classifiers we get: correct identification of ham: 99.8% correct identification of spam: 25.2% _ From the matrix with the Multinomial classifiers we get: correct identification of ham: 99.2% correct identification of spam: 90.2%

From this we get that the the Multinomial classifier is better for identifying spam/ham emails for this dataset. We can also see from the matrices above that the amount of false idenification of ham was far greater with the Bernoulli classifier than with the Multinomial classifier. Since the multinomial classifier takes the number of times a word appears into consideration (unlike the Bernoulli) we can conclude that the number of times a word appear matters for this dataset.

Question 3

We begin by seperating the easy/hard ham types into two different data sets together with the spam. This will be the data sets that we use later to determine the difference between easy ham vs spam and hard ham vs spam.

#making 2 dataframes. One for all with type easy ham and one for all with type hard ham df_easy = df.loc[(df['types'] == 'easy ham') | (df["types"] == "spam")] df_hard = df.loc[(df['types'] == 'hard ham') | (df["types"] == "spam")] df_easy.name = 'Easy' df_hard.name = 'Hard' dfs = [df_easy, df_hard]

Then we did same as we have done before but we used different data frames in this case.

#printing easy ham vs spam and hard ham vs spam with both bernoulli and multinomial as classifiers. for d in dfs: ham_train, ham_test, spam_train, spam_test = train_test_split(d['content'], d['class'], test_size=0.3,random_state=109) ham_train_vector = vectorizer.fit_transform(ham_train) ham_test_vector = vectorizer.transform(ham_test) names = ['Easy ham vs spam', 'Hard ham vs spam'] # Multinomial mnb = MultinomialNB().fit(ham_train_vector, spam_train) spam_pred = mnb.predict(ham_test_vector) print("Accuracy for Multinomial:",metrics.accuracy_score(spam_test, spam_pred)) # Bernoulli bnb = BernoulliNB(binarize=0).fit(ham_train_vector, spam_train) spam_pred2 = bnb.predict(ham_test_vector) print("Accuracy for Bernoulli:",metrics.accuracy_score(spam_test, spam_pred2)) # Confusion matrix with multinomial as classifier cmp1 = ConfusionMatrixDisplay.from_estimator( mnb, ham_test_vector, spam_test, display_labels=['ham', 'spam'], normalize='true', ) cmp1.ax_.set_title('Confusion Matrix - MultinomialNB: Spam vs ' + d.name) #Calculate true positives and true negatives for the Multinomial Naïve Bayes Classifier tn2, fp2, fn2, tp2 = confusion_matrix(spam_test, spam_pred).ravel() truepositive2 = tn2 / (tn2 + fp2) truenegative2 = tp2 / (tp2 + fn2) print(d.name + ': True positive for Multinomial: ' + str(truepositive2)) print(d.name + ': True negative for Multinomial: ' + str(truenegative2)) #Confusion matrix with bernoulli as classifier cmp2 = ConfusionMatrixDisplay.from_estimator( bnb, ham_test_vector, spam_test, display_labels=['ham', 'spam'], normalize='true', ) cmp2.ax_.set_title('Confusion Matrix - BernoulliNB: Spam vs ' + d.name) #Calculate true positives and true negatives for the Bernoulli Naïve Bayes Classifier tn1, fp1, fn1, tp1 = confusion_matrix(spam_test, spam_pred2).ravel() truepositive = tn1 / (tn1 + fp1) truenegative = tp1 / (tp1 + fn1) print(d.name + ': True positive for Bernoulli: ' + str(truepositive)) print(d.name + ': True negative for Bernolli: ' + str(truenegative))

From the above matrices we get the following:

For easy ham: correct identification of ham with MultinomialNB: 99.8% correct identification of spam with MultinomialNB: 88.3% correct identification of ham with BernoulliNB: 100% correct identification of spam with BernoulliNB: 57.6% _ For hard ham: correct identification of ham with MultinomialNB: 78% correct identification of spam with MultinomialNB: 99.3% correct identification of ham with BernoulliNB: 50% correct identification of spam with BernoulliNB: 97.9%

From this we can conclude that the multinomial classifier is better than bernoulli regarding both easy ham and hard ham. For easy ham, we can see that the bernoulli classifier only gives the correct information 57.6% of the times for spam emails when the multinomial classifier gives 88.3% correct information for spam emails. For hard ham the the bernoulli classifier only gives the correct information 50% of the times for ham emails when the multinomial classifier gives 78% correct information for ham emails. That is a significant difference in both cases where the multinomial classifier comes out on top.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Assignment&nbsp;4&nbsp;-&nbsp;Bayesian&nbsp;models&nbsp;

By&nbsp;Julia&nbsp;Jönmark&nbsp;and&nbsp;Hanna&nbsp;Söderström&nbsp;

Hours&nbsp;spent:&nbsp;10h&nbsp;each

Question 1

Question 2

Question 3

Assignment 4 - Bayesian models

By Julia Jönmark and Hanna Söderström

Hours spent: 10h each