Building a Wine Recommendation System With a Twist
Wines seem to lend themselves very well to data science projects for a number of reasons: availability of data, number of features available per wine and general love for the stuff being a few of the main driving factors. As such, there are no shortages of brilliant wine based projects on display, and for this project I wanted to take a slightly different approach towards creating a recommendation model. The approach I landed on was to build an NLP model which calculates similarity between wines based on the reviews of expert sommeliers.
Summary
Imports and Data Cleaning
Missing Data
Now the predictor matrix has been created - let's tackle the missing data. For the first iteration of this recommender system, I will drop observations with missing values across the board instead of being more selective. This cuts the available data in half. Further iterations of this project should try a different, less brutal, approach towards missing data.
Feature Engineering Unique Names for Wines
Let's create a more detailed name for each wine by combining 'winery' and 'designation'.
Removing Duplicate Values
In order for the end user to recieve recommendations, using this model and approach, there needs to be a unique name for each wine. With so many duplicates this becomes tricky. For now I will drop duplicated wine values, which significantly reduces the volume of data but let's the project proceed. There are certainly much better ways to proceed here, but for the sake of moving the project on I am taking the quicker route here.
Reducing Scope of Data
At this point in the project I had successfully created the recommendation function, however, when I tried to deploy the model using Streamlit I recieved endless connection timeout errors. After many, many hours of digging and researching I came to the conclusion the issue must lie with the connection to the AWS s3 bucket hosting my model.pkl file and predictors.csv. The files were simply too large to be loaded into memory and cached on each running of the script compiled for deployment.
The solution was to reduce the scope of the project to focus on the wine of one country instead of all countires, hence reducing the size of the files involved. As I am a big fan of Italian wines I decided to put this bias to good use and make this an Italian wine recommendation project.
Creating a Search for Wine Feature
One of the features I wanted to add to the deployed project was the ability to click a button and immediately search for the wine being recommended. Having looked around for a wine website with a predictable URL syntax for searches, I landed on www.wine-searcher.com who simply append each search term to the URL between + symbols, alongside the name of the country. Knowing this I could create a search URL for each wine in the dataset.
Vectorizing With Tf-idf
Time to turn the sommelier reviews into vectors using Tf-idf in order to proceed with the model. The parameters were chosen after a few rounds of trial and error - I had a feeling the ngram_range would work well set to 2 or 3 as some of the descriptive language being used in the descriptions were bigrams and trigrams ("sweet berry", "forest floor".etc). The regex pattern chosen was very much trial and error, using a description as an example and plugging it into regexr.com.
Calculating Similarity Scores
The Recommender
The recommender function works by taking the sigmoid_kernel scores and mapping them against the index pandas series, which is itself conceived by taking the index of the predictor matrix and the name values for each wine. The result is each wine is given a similary score to the other wines in the predictor matrix. The series is then sorted and the function returns the 3 wines with the highest similarity score to the wine passed in to the function.
Exporting the Model
Adapting for Streamlit
Adapting the above code for streamlit was a case of spending a few days reading through documentation and familiarising myself with the decorators and streamlit syntax. Once I had accomplished this, the next step was to move this code out of a notebook format and rewriting the code, using the streamlit decorators, in order to allow for user input and loading of model files from an s3 bucket. I highly recommend Streamlit for anyone looking to deploy their models - very user friendly syntax and intuitive to understand!