In this article we are going to be discussing some of the new features available in the new releases of Spacy and Numpy which you can test out in deepnote!
SpaCy is my go to library for all things natural language processing. SpaCy can handle all stages in an NLP workflow including pre-processing and extending to full natural language understanding systems. Specifically spaCy is designed for spaCy for production. Thats means if youre looking to build a system that can scale and serve users spaCy might be right up your alley.
SpaCy summarizes pretty well what its able to do so I will attach that chart here.
In their own words:
spaCy v3.0 features all new transformer-based pipelines that bring spaCy’s accuracy right up to the current state-of-the-art. You can use any pretrained transformer to train your own pipelines, and even share one transformer between multiple components with multi-task learning. Training is now fully configurable and extensible, and you can define your own custom models using PyTorch, TensorFlow and other frameworks. The new spaCy projects system lets you describe whole end-to-end workflows in a single file, giving you an easy path from prototype to production, and making it easy to clone and adapt best-practice projects for your own use cases.
Addition of Transformers
Likely the biggest change in the v3 release is the addition of transformers. Spacy now has access to transformer architechture models to perform tasks like Named Entity Recognition.
The new models can be used in the same way as in previous versions. To use the new transformer model you will need to download
Notice how "Madeupville" was labeled as a Geopolitical Entity despite as you may have been able to figure out from the name is not something that exists or that the model would be able to understand from the word itself but only through context.
The transformers is using the
roberta-base model in the background courtesy of huggingface (more on them later) and you can read more about the release here.
NumPy is pretty fundamental when it comes to working with numbers in Python. It is used in nearly every scientific python library. It has plenty of simple and efficient ways of modifying and working with Python lists and Matrices. Numpy also keeps your code simple and fast because in the background it's using precompiled C code so you get the simplicity of coding in Python with all the speed of C. Often times you'll never consciously decide to work with NumPy because it was already a requirement of your toolset to begin with.
As said previously anytime youre working with numbers or vectors, you'll end up working with Numpy. Numpy is required by transformers, torch, spacy, scipy, scikit-learn, pandas, matplotlib, Keras and Tensorflow just to name a few. You'll be using it if youre doing anything scientific in Python Numpy will be there.
In their own words
This NumPy release is the largest so made to date, some 684 PRs contributed by 184 people have been merged. See the list of highlights below for more details. The Python versions supported for this release are 3.7-3.9, support for Python 3.6 has been dropped. Highlights are
- Annotations for NumPy functions. This work is ongoing and improvements can be expected pending feedback from users.
- Wider use of SIMD to increase execution speed of ufuncs. Much work has been done in introducing universal functions that will ease use of modern features across different hardware platforms. This work is ongoing.
- Preliminary work in changing the dtype and casting implementations in order to provide an easier path to extending dtypes. This work is ongoing but enough has been done to allow experimentation and feedback.
- Extensive documentation improvements comprising some 185 PR merges. This work is ongoing and part of the larger project to improve NumPy’s online presence and usefulness to new users.
- Further cleanups related to removing Python 2.7. This improves code readability and removes technical debt.
- Preliminary support for the upcoming Cython 3.0
Numpy has a few new functions out with the latest release!
One such new function is the
numpy.broadcast_shapes function. If you don't know about array broadcasting feel free to read more about it here but the new function will let you know what the resulting array shape will be when two shape tuples are broadcasted against each other.
From Numpys own examples:
Huggingface is a lot of things, but in summary they are a company with some of the most bleeding edge resources for NLP with a heavy emphasis on transformers and deep interoperability with tensorflow and pytorch.
If you are looking to use transformers then huggingface is a one stop shop. There are plenty of pretrained models to help get state of the art results with a very low barrier to entry. The transformers library is incredibly easy to use and mmeches nicely with the huggingface model hub which we will discuss in a minute.
Huggingface recently revamped their models website making it easier than ever to test and download a variety of models designed for different tasks. You can filter by model tasks, datasets and language, then by most popular to find a model that suits your needs.
As an exmaple of what you can do with these models, one of the most exciting things you can try out is Zero shot classification. Zero shot classification is when a pretrained model is used to classify text on class names it has not been trained on previously. Some great explanations can be found here
Zero Shot Classification
Swap out the text and the labels for yourself! The zero shot classification task will try to classify despite not being trained prior.