NLP Assignment 2 Report
The algorithm used here is the LSTM based Neural Network, which is a state-of-the-art approach for Language Modelling tasks nowadays. The corpus given consisted of ~50,000 sentences, which were earlier split in the ratio 7:2:1, and thus
train.txt consists of 35,000 sentences,
valid.txt consists of 10,000 sentences and
test.txt consists of 5,000 sentences.
The model was trained for various epochs, but due to some issues with Google Colab, not all of them could be stored. Further, at the time of calculating perplexity of the sentences, loading all the sentences was causing the Notebook to crash, and so in the perplexity files, only the first 30,000 tokens have been considered to avoid the crashing. NOTE that all the sentences in
train.txt were used to
Handling Unknown Words: Unknown words were handled by using the <unk> tag for all the words in training set whose count was lesser than or equal to 3. Then, while calculating perplexity of testing/validation datasets, an out of vocabulary word was mapped to the lemma as well. For handling sentence boundaries, <sent> tag was used.
LM1-LM4 are differnt LSTM-based LMs created during different epochs.
As we can see, the perplexities of Train, Test and Validate are of similar order. On the other hand, in case of statistical models with Kneyser-Ney and Witten Bell Smoothing, the perplexities were really low for training data, but high for testing data.
Kneser-Ney Training Perplexity: 3.4081123363458143
Kneser-Ney Validation Perplexity: 16778.791735467417
Kneser-Ney Testing Perplexity: 26457.949957689852
Witten Bell Training Perplexity: 2.468052151268716
Witten Bell Validation Perplexity: 293.50395706835184
Witten Bell Testing Perplexity: 462.81717604751435
Thus, as we can see, Neural Models are learning on Semantic data(by relating similar words with similar vectors), rather than just the counts of words occuring after one, which is the case in Siatistical Models. Thus, Neural Models give comparable performance for training, testing and validation data, as they are not learning which words occur after which, but rather correlate it with the vector representation through embeddings. Thus, we can even see worse performance of Training perplexities in Neural LMs as compared to Validation and Training, as it is not learning just on the counts from the training data.
We can also see that statistical models might also perform really good when there are few/no "unseen" n-grams and the unknown n-grams are handled well. It appears that Witten Bell works really good, but that might have been an error on my part on penalizing it lesser in unknown words and unknown contexts, as Kneyser-Ney seems to be a better estimate of what we would expect from a statistical model. As we can see in Kneyser-Ney, the perplexities on Brown Corpus are really small(single digit) for training,as it KNOWS for the training data what word will most likely occur based on its counts.However, it gives ~16k and ~26k errors for Training and Validation, since it has learnt on counts and not on embeddings.Even in Witten Bell, the perplexities are small for testing and validation data,but still there is a huge difference in training errors and testing/validation errors.
Thus, the basic difference between statistical and neural networks is that Statistical work really well on training data, but perform poorly on testing/validation sets. Neural models, on the other hand, give similar performance across all types of data.