--2023-03-01 19:16:49-- https://spamassassin.apache.org/old/publiccorpus/20021010_easy_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1677144 (1.6M) [application/x-bzip2]
Saving to: ‘20021010_easy_ham.tar.bz2.4’
20021010_easy_ham.t 100%[===================>] 1.60M --.-KB/s in 0.006s
2023-03-01 19:16:50 (290 MB/s) - ‘20021010_easy_ham.tar.bz2.4’ saved [1677144/1677144]
--2023-03-01 19:16:51-- https://spamassassin.apache.org/old/publiccorpus/20021010_hard_ham.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1021126 (997K) [application/x-bzip2]
Saving to: ‘20021010_hard_ham.tar.bz2.4’
20021010_hard_ham.t 100%[===================>] 997.19K --.-KB/s in 0.004s
2023-03-01 19:16:52 (249 MB/s) - ‘20021010_hard_ham.tar.bz2.4’ saved [1021126/1021126]
--2023-03-01 19:16:52-- https://spamassassin.apache.org/old/publiccorpus/20021010_spam.tar.bz2
Resolving spamassassin.apache.org (spamassassin.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to spamassassin.apache.org (spamassassin.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1192582 (1.1M) [application/x-bzip2]
Saving to: ‘20021010_spam.tar.bz2.4’
20021010_spam.tar.b 100%[===================>] 1.14M --.-KB/s in 0.01s
2023-03-01 19:16:53 (91.1 MB/s) - ‘20021010_spam.tar.bz2.4’ saved [1192582/1192582]
total 20M
drwxrwxrwx 6 root root 21 Mar 1 19:16 .
drwxr-xr-x 1 root root 197 Mar 1 19:16 ..
-rw-r--r-- 1 root root 1.6M Jun 29 2004 20021010_easy_ham.tar.bz2
-rw-r--r-- 1 root root 1.6M Jun 29 2004 20021010_easy_ham.tar.bz2.1
-rw-r--r-- 1 root root 1.6M Jun 29 2004 20021010_easy_ham.tar.bz2.2
-rw-r--r-- 1 root root 1.6M Jun 29 2004 20021010_easy_ham.tar.bz2.3
-rw-r--r-- 1 root root 1.6M Jun 29 2004 20021010_easy_ham.tar.bz2.4
-rw-r--r-- 1 root root 998K Dec 16 2004 20021010_hard_ham.tar.bz2
-rw-r--r-- 1 root root 998K Dec 16 2004 20021010_hard_ham.tar.bz2.1
-rw-r--r-- 1 root root 998K Dec 16 2004 20021010_hard_ham.tar.bz2.2
-rw-r--r-- 1 root root 998K Dec 16 2004 20021010_hard_ham.tar.bz2.3
-rw-r--r-- 1 root root 998K Dec 16 2004 20021010_hard_ham.tar.bz2.4
-rw-r--r-- 1 root root 1.2M Jun 29 2004 20021010_spam.tar.bz2
-rw-r--r-- 1 root root 1.2M Jun 29 2004 20021010_spam.tar.bz2.1
-rw-r--r-- 1 root root 1.2M Jun 29 2004 20021010_spam.tar.bz2.2
-rw-r--r-- 1 root root 1.2M Jun 29 2004 20021010_spam.tar.bz2.3
-rw-r--r-- 1 root root 1.2M Jun 29 2004 20021010_spam.tar.bz2.4
drwxrwxr-x 2 root root 3 Feb 13 20:35 .deepnote
drwx--x--x 2 500 500 2.5K Oct 10 2002 easy_ham
drwx--x--x 2 1000 1000 252 Dec 16 2004 hard_ham
drwxr-xr-x 2 500 500 503 Oct 10 2002 spam
precision recall f1-score support
Ham 0.96 1.00 0.98 766
Spam 0.98 0.81 0.89 151
accuracy 0.97 917
macro avg 0.97 0.91 0.94 917
weighted avg 0.97 0.97 0.97 917
precision recall f1-score support
Ham 0.89 0.99 0.94 766
Spam 0.92 0.37 0.53 151
accuracy 0.89 917
macro avg 0.90 0.68 0.73 917
weighted avg 0.89 0.89 0.87 917
precision recall f1-score support
Hard ham 0.95 0.79 0.86 75
Spam 0.90 0.98 0.94 151
accuracy 0.92 226
macro avg 0.93 0.88 0.90 226
weighted avg 0.92 0.92 0.91 226
precision recall f1-score support
Hard ham 1.00 0.67 0.80 75
Spam 0.86 1.00 0.92 151
accuracy 0.89 226
macro avg 0.93 0.83 0.86 226
weighted avg 0.91 0.89 0.88 226
Discussion
If we start to compare multinomial (MnB) and Bernoulli Naive Bayes(BnB) when ran on hard ham we see som differnces. To begin with BnB managed to classify all mails that were hard ham correctly however it missclassified many spam-mails as hard ham. The MnB had a better accuracy overall and although it missclassified some of the hard ham mail, it did a better job of classifying the spam ones.
If we look at the spam versus easy ham in task 2 we see that MnB outperformed BnB even more in this case and it even had the highest accuracy of all the models.
To summarize we can say that in this case MnB works better to classify the emails.
The reason behind this may be that the two models operate in diffrent ways. One diffrence between the two is that MnB takes the frequency of appearing words into account while BnB only checks if the word exists or not. This makes sense since frequency of words usually do differ from spam emails and regular emails. For example, while the word 'rich' is likely to appear both in a ham and spam emails, it will likely appear more times in a spam emails to try and lure the reader. This may also be the reason why MnB performed worse on hard ham since the hard ham mails might defined by having a higher frequency of these 'suspicious' words and thus easly being missclassifed as spam by the MnB. To summarize, since frequency is important in this task, MnB is the better model to use.