Welcome to my Notebook!
I love the transformers library. It is by far the easiest way to get started using transformers for NLP, which is the currently the bleeding edge
The first step is grabbing the model and the tokenizer from the transformers library
Don't worry if the following cell takes some time! The model just needs a minute to download.
Note: GPT2 is a newer transformer model for text generation, but the library only has support for GPT currently for text generation.
^^ This is the fun part! Go wild! Pick the prompt text!
Since transformers supplies a tokenizer for the GPT model, this was a much easier solution than using the universal sentence encoder. The tensors return as type pytorch as opposed to tensorflow because I find that I run into fewer bugs when that is the case (this might be a result that the library was originally written for pytorch).
Let's get some sequences!
There is a lot to unpack in that cell. Credit where credit is due- the library has fantastic documentation, so lets unpack.
Input ids is simply specifying what the input will be.
Do_sample - this prevents the model from just picking the most likely word at every step greedily.
Max_len - what is the length of sequence we want?
Temperature - I find this one really interesting. This is a measure of how risky our model will be when picking words. Feel free to tweak this!
Top_k and top_p are similar in that they limit the number of words the model considers when decoding before randomly sampling from the word probabilities.
The repetition penalty is meant to avoid sentences that repeat themselves without anything really interesting.
Kudos to both the docs https://huggingface.co/transformers/main_classes/model.html?highlight=tfpretrained#transformers.TFPreTrainedModel
and this excellent medium post for helping me understand the nitty gritty here.
And tada! We have predicted text. However, that is not the only way to generate text. By far the simplest way is with a pipeline.
Another great tool in language modeling in the Universal sentence encoder. It will take any sentence and return a size 512 vector encoding it. Then you can perform operations like cosine similarity to see how similar they are.
And out pop our embeddings. Let's check if our first 2 sentences are more closely related than the last 2.
The first 2 sentences are definitely more closely related than any other combination by a factor of 2 and 20 respectively.
Given a dataset to train on and then some text prompt, this model would use the Universal Sentence Embedding to generate new text. However, pretrained models are more efficient to use and the results will be much stronger so the transformers library is preferable to use.
References and Credit:
https://github.com/ageron/handson-ml2 - Great textbook with tons of useful code samples for near any machine learning project
https://huggingface.co/transformers/ - Previously mentioned but still a great resource for everything NLP
Conclusion
Thanks for reading all of my code this way through. If you have any questions regarding the code included within feel free to reach out to me at josh.zwiebel@uwaterloo.ca
- Josh Zwiebel