Assignment 4
Johanna Wiberg (jwiberg): 19 hours
Oscar Forsberg (oscfors): 19 hours
1a,b)
2)
The main difference between the two classifiers is that Multinomial counts the frequency of words that occur in the email while Bernoulli only cares about binary values, for instance True or False, 0 or 1, Yes or No. For instance, this means that by removing stop words in 'stop_words='english'' (an existing list with words like "and", "then", "or"...) multinomial will perform better since words like this aren't actually spam even though they have a high frequency, they just occur often in the english language. Removing words like this will not affect the bernoulli naive bayes.
3)
4a)
With this method we print out the 50 most common words in ham and in spam. If you look carefully at these list, you can see that some of them are included in both lists. If one word is more common in for example spam, we don't want to use this as a stopword, since removing it might lead to poorer predictions. Instead we compare these two list and keep the words that are common in both data sets. These words do we instead use as stop words, and this is the reason why we use stopwords: because words that are "low-level" should not have any impact on the results, thus we remove them.
As you can see with the least common words for spam and for all data sets, most of them seem like names or codes. So it would be more appropriate to remove the words who are only occurring once. So we use the inbuilt parameter min_df = 1.