Building a Wine Recommendation System With a Twist

Wines seem to lend themselves very well to data science projects for a number of reasons: availability of data, number of features available per wine and general love for the stuff being a few of the main driving factors. As such, there are no shortages of brilliant wine based projects on display, and for this project I wanted to take a slightly different approach towards creating a recommendation model. The approach I landed on was to build an NLP model which calculates similarity between wines based on the reviews of expert sommeliers.

Summary

Imports and Data Cleaning

import pandas as pd import numpy as np from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import linear_kernel, sigmoid_kernel import pickle import regex

df = pd.read_csv('wine_data.csv') df.info()

df.sample(5)

df.drop('Unnamed: 0', axis=1, inplace=True) df.drop('region_2', axis=1, inplace=True) df.sample(5)

predictors = df[['country', 'description', 'designation', 'province', 'region_1', 'variety', 'winery']] predictors.info()

Missing Data

Now the predictor matrix has been created - let's tackle the missing data. For the first iteration of this recommender system, I will drop observations with missing values across the board instead of being more selective. This cuts the available data in half. Further iterations of this project should try a different, less brutal, approach towards missing data.

predictors.dropna(inplace=True)

predictors = predictors.reset_index() predictors.info()

Feature Engineering Unique Names for Wines

Let's create a more detailed name for each wine by combining 'winery' and 'designation'.

predictors['name'] = predictors['winery'] + ', ' + predictors['designation'] predictors.drop('index', axis=1, inplace=True)

predictors.sample(10)

Removing Duplicate Values

In order for the end user to recieve recommendations, using this model and approach, there needs to be a unique name for each wine. With so many duplicates this becomes tricky. For now I will drop duplicated wine values, which significantly reduces the volume of data but let's the project proceed. There are certainly much better ways to proceed here, but for the sake of moving the project on I am taking the quicker route here.

predictors.drop_duplicates(subset='name', keep='last', inplace=True) predictors.reset_index(inplace=True) predictors.drop('index', axis=1, inplace=True)

predictors.info()

Reducing Scope of Data

At this point in the project I had successfully created the recommendation function, however, when I tried to deploy the model using Streamlit I recieved endless connection timeout errors. After many, many hours of digging and researching I came to the conclusion the issue must lie with the connection to the AWS s3 bucket hosting my model.pkl file and predictors.csv. The files were simply too large to be loaded into memory and cached on each running of the script compiled for deployment.

The solution was to reduce the scope of the project to focus on the wine of one country instead of all countires, hence reducing the size of the files involved. As I am a big fan of Italian wines I decided to put this bias to good use and make this an Italian wine recommendation project.

predictors_ita = predictors[(predictors['country'] == 'Italy')] predictors_ita.reset_index(inplace=True) predictors_ita.info()

predictors_ita.sample(10)

predictors_ita.province.value_counts()

#predictors_ita.to_csv('wine_pred_matrix_ita.csv')

Creating a Search for Wine Feature

One of the features I wanted to add to the deployed project was the ability to click a button and immediately search for the wine being recommended. Having looked around for a wine website with a predictable URL syntax for searches, I landed on www.wine-searcher.com who simply append each search term to the URL between + symbols, alongside the name of the country. Knowing this I could create a search URL for each wine in the dataset.

predictors_ita['search_string'] = predictors_ita.name.str.replace(', ','+')

predictors_ita['search_string'] = predictors_ita['search_string'].str.replace(' ','+')

predictors_ita['search_string'] = predictors_ita['search_string'] + '+italy'

predictors_ita['search_url'] = 'https://www.wine-searcher.com/find/'

predictors_ita['search_url'] = predictors_ita['search_url'] + predictors_ita['search_string']

predictors_ita['search_url'] = predictors_ita['search_url'] + '/-/europe'

predictors_ita.drop('index', axis=1, inplace=True)

predictors_ita['search_url']

predictors_ita.head()

#predictors_ita.to_csv('wine_pred_matrix_ita.csv')

Vectorizing With Tf-idf

Time to turn the sommelier reviews into vectors using Tf-idf in order to proceed with the model. The parameters were chosen after a few rounds of trial and error - I had a feeling the ngram_range would work well set to 2 or 3 as some of the descriptive language being used in the descriptions were bigrams and trigrams ("sweet berry", "forest floor".etc). The regex pattern chosen was very much trial and error, using a description as an example and plugging it into regexr.com.

vectors_ita = TfidfVectorizer(min_df = 3, max_features = None, strip_accents = 'unicode', analyzer = 'word', token_pattern = '\w{2,}', ngram_range = (1,3), stop_words = 'english')

vectors_matrix_ita = vectors_ita.fit_transform(predictors_ita['description'])

vectors_matrix_ita.shape

Calculating Similarity Scores

sig_kern_ita = sigmoid_kernel(vectors_matrix_ita, vectors_matrix_ita)

sig_kern_ita.shape

sig_kern_ita

index = pd.Series(predictors_ita.index, index=predictors_ita['name']).drop_duplicates()

The Recommender

The recommender function works by taking the sigmoid_kernel scores and mapping them against the index pandas series, which is itself conceived by taking the index of the predictor matrix and the name values for each wine. The result is each wine is given a similary score to the other wines in the predictor matrix. The series is then sorted and the function returns the 3 wines with the highest similarity score to the wine passed in to the function.

def recommend_wine(name, sig_kern=sig_kern_ita): indx = index[name] sigmoid_score = list(enumerate(sig_kern[indx])) sigmoid_score = sorted(sigmoid_score, key = lambda x:x[1], reverse = True) sigmoid_score = sigmoid_score[1:4] position = [i[0] for i in sigmoid_score] return predictors.iloc[position]

recommend_wine('Cantina Terre del Barolo, Riserva')

Exporting the Model

data = {"model": sig_kern_ita} with open('wine_model_ita.pkl', 'wb') as file: pickle.dump(data, file)

Adapting for Streamlit

Adapting the above code for streamlit was a case of spending a few days reading through documentation and familiarising myself with the decorators and streamlit syntax. Once I had accomplished this, the next step was to move this code out of a notebook format and rewriting the code, using the streamlit decorators, in order to allow for user input and loading of model files from an s3 bucket. I highly recommend Streamlit for anyone looking to deploy their models - very user friendly syntax and intuitive to understand!

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Building a Wine Recommendation System With a Twist