In this notebook, we'll be training spaCy to identify
FOOD entities from a body of text - a task known as named-entity recognition (NER). If all goes well, we should be able to identify the foods from the following sentences:
chocolate ice creamas a little treat for myself.
chipsfor lunch today.
cheesefrom Tesco yesterday.
spaCy has a NER accuracy of 85.85%, so something in that range would be nice for our
We'll use the following approach:
FOODentities. Stir until good enough.
See the Evaluation and Results section for a full breakdown.
We'll be using food data from the USDA's Branded Food's dataset.
You can see that the
description column has the names for the foods we're interested in.
We have way more rows than we need, so let's:
,and so on).
Now, we need to think about how we want our data to be distributed. By reducing to 3-worded food items, we effectively have food entities that look like this:
grilled cheese| 2-worded
chocolate ice cream| 3-worded
When feeding our training data into spaCy, we want to think about the bias we want spaCy to avoid.
Because the majority of our food entities are multi-worded, spaCy would develop a bias for multi-worded foods. If we look back to our example
grilled cheese, it's not a big deal if spaCy identifies
cheese instead of
grilled cheese. It is a big deal if spaCy fails to identify
cheese at all.
As an aside, I ran this experiment, and spaCy only had a 10% accuracy for classifying single-worded
FOOD entities, failing with foods such as
So let's filter the dataset further, such that 45% are one-worded foods, 30% are two-worded foods, and 25% are three-worded foods.
At this point, we want to create different placeholders that we can insert our food entities into. I'll come back to update this once I'm testing the project with real users.
We'll break up our food sentences (which contain our entities) into a training set and a test set. We also need the data to be in a specific format for training:
data = [ ("I love chicken", [(8, 13, "FOOD")]), ... ]
Nice, we now have ~500 training sentences, with each sentence either containing 1, 2, or 3
We also have plenty of test data. Doesn't matter too much that it's not evenly distributed.
As mentioned in the overview, we also need to generate sentences that contain spaCy entities.
This helps us avoid the situation where the NER model is able to identify the
FOOD entities, but forgets how to classify entities like
PERSON isn't important for nutrition-tracking, other entities like
CARDINAL will help us associate foods with their quantities later on:
slices of toast.
We'll keep sentences of a similar length to our generated food sentences.
This takes a while. Unfortunately we need a lot of sentences because the entities we'll be able to identify aren't evenly distributed as we'll see later.
In the previous step we filtered out sentences that were too short and too long. We've also used spaCy to predict the entities from the filtered sentences.
When splitting the train and test data, we'll ensure that the revision training data has at least 100 examples of the different entity types.
Here are the entities and their counts that were captured in our revision training sentences:
Here are the entities and counts captured in our revision test sentences. This just shows that our initial sentences had a large number of examples for
PERSON but very few for
For every food sentence, I have revision sentences. I haven't actually seen guidance on what this should be, so this is one of those "stir until good enough" moments.
Initial results seem pretty good, let's evaluate on a wider scale.
These results are really positive. We're stumbling with
1_worded_foods accuracy, though that's potentially because we had more testing data for
Perhaps with more test examples for
three_worded_foods, we'd also see that accuracy trend to ~91%.
These results are a little harder to interpret. After all, we're testing entities that the original spaCy model predicited for us. Those predicted entities may well be wrong since spaCy's accuracy is at around 86%. If 14% of the entities we're using to verify the accuracy of our new model are wrong, then where does that leave us?
A better comparison would be to load in spaCy's original model and use that to predict against this test set and compare that accuracy % to this one of 71%.
We could then use that as a benchmark for measuring how introducing
FOOD entities deteriorates our model.
Note: The online notebook I used keeps crashing as I attempt to do this, so will do this on my local computer.
The results we arrived at is the following for our
|Sentences with one food||91.44%|
|Sentences with two foods||94.76%|
|Sentences with three foods||96.88%|
The results for our existing entities:
I'm pretty happy with the accuracy of the
FOOD entities, though it'd be worthwhile increasing the test sentences.
For the revision entities, I'll first need to test this the accuracy with spaCy's original language model. If the results are still poor,
I'll experiment with different revision datasets and adjust the ratio of food:revision sentences.
Hope you've enjoyed reading this as much as I've enjoyed creating it! This provides an honest look into how I problem solve. If you want to see what I'm up to in future, you can find me on Twitter. I'll keep on posting personal changelogs of how I'm doing.