Assignment 4

#Download and extract data !wget https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2 !wget https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2 !wget https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2 !tar -xjf 20021010_easy_ham.tar.bz2 !tar -xjf 20021010_hard_ham.tar.bz2 !tar -xjf 20021010_spam.tar.bz2

!ls -lah

# Write your pre-processing code here import os import codecs from sklearn.model_selection import train_test_split # Folder names easy_ham_folder = "easy_ham" spam_folder = "spam" # Function that opens and loads the emails and adds them to lists def load_emails(folder,label,encoding='ISO-8859-1' ): emails = [] email_labels = [] for filename in os.listdir(folder): email = "" with open(os.path.join(folder, filename), 'r', encoding=encoding) as f: for chunk in iter(lambda: f.read(4096), ""): email += chunk emails.append(email) email_labels.append(label) return emails, email_labels # Building datasets ham, ham_label = load_emails(easy_ham_folder, 'Ham') spam, spam_label = load_emails(spam_folder, 'Spam') # Splitting the datasets into train sets and test sets hamtrain, hamtest, ham_label_train, ham_label_test = train_test_split(ham,ham_label, test_size=0.3, random_state=None) spamtrain, spamtest, spam_label_train, spam_label_test = train_test_split(spam,spam_label, test_size=0.3, random_state=None)

#Write your code here from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import BernoulliNB, MultinomialNB from sklearn.metrics import confusion_matrix, classification_report,ConfusionMatrixDisplay import matplotlib.pyplot as plt # Combining the training sets into one trainingset = hamtrain + spamtrain traininglabels = ham_label_train + spam_label_train # Combining the testing sets into one testset = hamtest + spamtest testlabels = ham_label_test + spam_label_test vectorize = CountVectorizer() # Vectoricing the train and test sets vectorize_train_set = vectorize.fit_transform(trainingset) vectorize_test_set = vectorize.transform(testset) # Initilising the Multinomial and Bernoulli Naive Bayes and fittning the data mnb = MultinomialNB() bnb = BernoulliNB() mnb.fit(vectorize_train_set,traininglabels) bnb.fit(vectorize_train_set,traininglabels) # Predicting on the test data pred_mnb = mnb.predict(vectorize_test_set) pred_bnb = bnb.predict(vectorize_test_set) # Create a classification report to evaluate the models class_report_mnb = classification_report(testlabels,pred_mnb) class_report_bnb = classification_report(testlabels, pred_bnb) print(class_report_mnb) print(class_report_bnb) # Create confusion matrices for both methods cm_mnb = confusion_matrix(testlabels,pred_mnb) cm_bnb = confusion_matrix(testlabels,pred_bnb) # Rates of TP and TN tp_mnb = cm_mnb[0][0]/(cm_mnb[0][0] + cm_mnb[1][0]) fn_mnb = cm_mnb[1][0]/(cm_mnb[0][0] + cm_mnb[1][0]) tp_bnb = cm_bnb[0][0]/(cm_bnb[0][0] + cm_bnb[1][0]) fn_bnb = cm_bnb[1][0]/(cm_bnb[0][0] + cm_bnb[1][0]) # Displaying the mnb matrix mnb_display = ConfusionMatrixDisplay(confusion_matrix=cm_mnb, display_labels=['Ham','Spam']) mnb_display.plot() plt.title(f'MnBTrue positives: {tp_mnb}, false negatives: {fn_mnb}') # Displaying the bnb matrix bnb_display = ConfusionMatrixDisplay(confusion_matrix=cm_bnb,display_labels=['Ham','Spam']) bnb_display.plot() plt.title(f'BnB True positives: {tp_bnb}, false negatives: {fn_bnb}')

#Code to report results here hard_ham_email = 'hard_ham' hard_ham, hard_ham_label = load_emails(hard_ham_email,'Hard ham') hardhamtrain, hardhamtest, hardham_label_train, hardham_label_test = train_test_split(hard_ham,hard_ham_label, test_size=0.3, random_state=None) trainingset_hard_ham = hardhamtrain + spamtrain traininglabels_hard_ham = hardham_label_train + spam_label_train testset_hard_ham = hardhamtest + spamtest testlabels_hard_ham = hardham_label_test + spam_label_test # Vectorising the hard ham data vectorize_hh = CountVectorizer() vectorize_hhtrain_set = vectorize_hh.fit_transform(trainingset_hard_ham) vectorize_hhtest_set = vectorize_hh.transform(testset_hard_ham) # Initilising the Multinomial and Bernoulli Naive Bayes and fittning the data mnb_hh = MultinomialNB() bnb_hh = BernoulliNB() mnb_hh.fit(vectorize_hhtrain_set,traininglabels_hard_ham) bnb_hh.fit(vectorize_hhtrain_set,traininglabels_hard_ham) # Predicting on the test data pred_mnb_hh = mnb_hh.predict(vectorize_hhtest_set) pred_bnb_hh = bnb_hh.predict(vectorize_hhtest_set) # Create a classification report to evaluate the models class_report_mnb_hh = classification_report(testlabels_hard_ham,pred_mnb_hh) class_report_bnb_hh = classification_report(testlabels_hard_ham, pred_bnb_hh) print(class_report_mnb_hh) print(class_report_bnb_hh) # Create confusion matrices for both methods cm_mnb_hh = confusion_matrix(testlabels_hard_ham,pred_mnb_hh) cm_bnb_hh = confusion_matrix(testlabels_hard_ham,pred_bnb_hh) # Rates of TP and TN tp_mnb_hh = cm_mnb_hh[0][0]/(cm_mnb_hh[0][0] + cm_mnb_hh[1][0]) fn_mnb_hh = cm_mnb_hh[1][0]/(cm_mnb_hh[0][0] + cm_mnb_hh[1][0]) tp_bnb_hh = cm_bnb_hh[0][0]/(cm_bnb_hh[0][0] + cm_bnb_hh[1][0]) fn_bnb_hh = cm_bnb_hh[1][0]/(cm_bnb_hh[0][0] + cm_bnb_hh[1][0]) # Displaying the mnb matrix mnb_hh_display = ConfusionMatrixDisplay(confusion_matrix=cm_mnb_hh, display_labels=['Hard Ham','Spam']) mnb_hh_display.plot() plt.title(f'Mnb True positives: {tp_mnb_hh}, false negatives: {fn_mnb_hh}') # Displaying the bnb matrix bnb_display_hh = ConfusionMatrixDisplay(confusion_matrix=cm_bnb_hh,display_labels=['Hard Ham','Spam']) bnb_display_hh.plot() plt.title(f'BnBTrue positives: {tp_bnb_hh}, false negatives: {fn_bnb_hh}')

Discussion

If we start to compare multinomial (MnB) and Bernoulli Naive Bayes(BnB) when ran on hard ham we see som differnces. To begin with BnB managed to classify all mails that were hard ham correctly however it missclassified many spam-mails as hard ham. The MnB had a better accuracy overall and although it missclassified some of the hard ham mail, it did a better job of classifying the spam ones.

If we look at the spam versus easy ham in task 2 we see that MnB outperformed BnB even more in this case and it even had the highest accuracy of all the models.

To summarize we can say that in this case MnB works better to classify the emails.

The reason behind this may be that the two models operate in diffrent ways. One diffrence between the two is that MnB takes the frequency of appearing words into account while BnB only checks if the word exists or not. This makes sense since frequency of words usually do differ from spam emails and regular emails. For example, while the word 'rich' is likely to appear both in a ham and spam emails, it will likely appear more times in a spam emails to try and lure the reader. This may also be the reason why MnB performed worse on hard ham since the hard ham mails might defined by having a higher frequency of these 'suspicious' words and thus easly being missclassifed as spam by the MnB. To summarize, since frequency is important in this task, MnB is the better model to use.

#Write your code here

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}