Emil Petterson S141550
Laust Dixen Munck S119059
Then the words'/characters' share of the total amount of words i calculated and then added:
Now a amount of words is generated using the frequencies calculated above.
The amount of words generated can be selected in the range() function.
Generates a random text
Generates a random text based on the trigram model
1.2 0-probability fix
Laplace smoothing will fix the 0-probability problem which essentially adds an entry of each token once that can be found in either test set of training set. This can be seen in the below formula, which is a 1-plus smoothing function, where w corresponds to each word, V the vocabulary and C the text doucment.
1.3 Highest probability
The lambda order in the above function is trigram, bigram and then unigram. Running the function with different lambda values, it seems that the function creates the highest probabilities when the unigram is weighted the highest.
Now we have the following information which we need later:
Removing unwanted characters
Removing words from test corpus that does not exist in either of the training sets:
Removing words from ent and bor text corpora that does not exist in the test set and implementing add-1 smoothing
As can be seen below with to examples the words 'film' and 'ending' we have calculated the likihoods using Eq. 4.14 from SLP-3 book 4, p.68 and is using add-1 smoothing.
To calculate which class the test sentence belongs to we use Eq. 4.9 from SLP-3 book 4, p.68.
The same logic is implemented in the code below: