# Machine Learning in Python - Project 1

Due Friday, March 6th by 5 pm.

Ballard Jack, Chauvet Charleston, Macaskill Callum, Tao Haorui

## 0. Setup

## 1. Introduction

*This section should include a brief introduction to the task and the data (assume this is a report you are delivering to a client). If you use any additional data sources, you should introduce them here and discuss why they were included.*

*Briefly outline the approaches being used and the conclusions that you are able to draw.*

## 2. Exploratory Data Analysis and Feature Engineering

# THIS BIT NEEDS DELETED

*Include a detailed discussion of the data with a particular emphasis on the features of the data that are relevant for the subsequent modeling. Including visualizations of the data is strongly encouraged - all code and plots must also be described in the write up. Think carefully about whether each plot needs to be included in your final draft - your report should include figures but they should be as focused and impactful as possible.*

*Additionally, this section should also implement and describe any preprocessing / feature engineering of the data. Specifically, this should be any code that you use to generate new columns in the data frame d. All of this processing is explicitly meant to occur before we split the data in to training and testing subsets. Processing that will be performed as part of an sklearn pipeline can be mentioned here but should be implemented in the following section.*

*All code and figures should be accompanied by text that provides an overview / context to what is being done or presented.

### 2.1 Exploratory Analysis

Before starting the data analysis we have to check the integrity and format of the data. Firstly, is it complete? Are there any missing values? And is the data continous or categorical?

Firstly, we see the data is complete (ie, no missing values), this makes moddeling easier as no imputation will be required.

Secondly, we we see a plot of the imdb rating, the value we are looking to maximise, against the other numerical data. We do see some patterns here, the rating changes with seasons, increase with total votes, and possibly a minor relationship with the number of lines. We'll come back to these relationships later.

Not all our columns are plotted above, due to their datatypes. Let's take a look at the types for each column.

We have a mixture of numbers and strings (in Pandas an object is either a string or mixed type, but here they're all strings). Whilst out numerical columns will fit well into our models (see pairs plot above), the object columns will require a certain amount of feature egineering.

The standard method of handling categorical data, and the method we used, is by creating indicator columns. But this isn't without it's drawbacks. Let's look at the directors.

## 2.2 Feature Engineering

Above we see two things; a list of the top 10 directors for The Office (by number of episdoes directed), and a histogram, a count of the number of episodes directed by each director.

We have 61 unique directors in the 183 episodes, of which over 30 directed only one. This isn't great for modelling. If we give every director an indicator column we will increase our dimensions by 60 (dropping one as reference), harming the reliability of our models. Our model would also end up fitting on a lot of variables which we have very little data on, if the only episode a director is involved in is particularly good, is this data we can rely on?

However we can't just throw away this data, we allow our models to decide if this data is relevant or not. So, as a compromise between these issues, we only modelled the top 10 directors (each gets an indicator column, every other director is lumped together as the reference column), this corresponds with the directors who were involved in 6 or more episodes.

We have other object columns that will need one-hot encoded, such as the writers.

We have a similar problem with writers, 40 total, with half writing for 2 or less episodes. Again, we pick the top 10 (written 10 episodes or more) and give them indicator columns. Note, that some episodes have multiple writers, however for the sake of keeping dimensionality under control, we count each writer seperately when an episode is written by several.

We have 3 object columns left, which will be handled in two different ways. Firstly, main_chars, a list of main characters appearing in that particular episode. There are 122 combinations of 17 different main characters in this show, again raising issues of dimensionality. There's also another issue with this data, although writers directors are a binary choice, there's a lot more too characters. What if we could bring in more information without increasing the number of dimensions? The Schrutepy package (the original source of our data) contains more than just character names, it also has the lines for each character. We can use this line data, instead of simple indicator columns, to give a more rounded view of which characters took part in an episode.

We now have columns for each of the top 25 characters (expanded slightly from the 17 'main characters'), and just two categorical columns remain; episode name and air date. How do we incorporate these into our model? There does appear to be a slight relationship between air date and imdb rating, but if our goal is create a formula to produce a higher rated new episode, what use are they? We chose to drop both air date and episode name from our models, as, although they may be useful in creating a of data we already have, they are factors beyond the control of NBC when looking to air a new episode (ie, if the model shows the most succesful episodes were in 2007, we can't act on that without a time machine).

There are also other, non-categorical datatype, columns with the same issue, namely season and episode number. For the same reasons as above there were dropped from our model.

Let the_office be the final data set we will be using for the rest of the report.

**TAKEN OUT n_votes, episode & season**

We now have a our final dataframe, 53 different variables for 186 episodes. This is quite a high ratio of variables to observations, and will need to be accounted for in our models.

## 3. Model Fitting and Tuning

*In this section you should detail your choice of model and describe the process used to refine and fit that model. You are strongly encouraged to explore many different modeling methods (e.g. linear regression, regression trees, lasso, etc.) but you should not include a detailed narrative of all of these attempts. At most this section should mention the methods explored and why they were rejected - most of your effort should go into describing the model you are using and your process for tuning and validating it.*

*For example if you considered a linear regression model, a classification tree, and a lasso model and ultimately settled on the linear regression approach then you should mention that other two approaches were tried but do not include any of the code or any in depth discussion of these models beyond why they were rejected. This section should then detail is the development of the linear regression model in terms of features used, interactions considered, and any additional tuning and validation which ultimately led to your final model.*

*This section should also include the full implementation of your final model, including all necessary validation. As with figures, any included code must also be addressed in the text of the document.*

## Baseline Model

### Helper Functions:

We wil be using the following two helper functions in this report:

## Baseline Model

Once we have cleaned the data, and removed the columns which are out of control, we can begin to build our model. To start, we will fit a quick linear regression model as a baseline to judge our future results agains. We can then go on to create more complicated, and hopefully, more accurate predictive models.

First we must split our data into training and testing sets. This is easily done using `train_test_split`

function from the `model_selection`

module of `sklearn`

. Taking $\mathbf{X}$ as our dataframe `the_office`

with the IMDB rating column removed, then setting this column as our outcome vector $\mathbf{y}$ , we can apply the function with the `test_size`

parameter set to $0.2$. **WHY CHOOSE THIS SIZE?** This splits $80\%$ of the data as the training set, and $20\%$ as the testing set. Randomizing the data with the `random_state`

parameter is important to reduce the impact of inherent biases in our original dataframe. We can do some quick error checking by seeing if the sum of the training data size and test data size is equal to the original data size.

It is now easy to generate a linear regression model of our data with the `LinearRegression`

module from `sklearn`

.

As we are interested in the predictive properties of our linear regression model, we need to evaluate the accuracy of it. This can be done by fitting the model using the training data (`X_train`

and `y_train`

) and using the `model_fit`

helper function we defined earlier to obtain our model's rmse using the validation data `X_test`

and `y_test`

. A fit ($y$ vs $\hat{y}$) and residual ($y$ vs $y-\hat{y}$) plot of these results is provided.

However, we see that, in the fit plot, when the real output y is greater than 9.5, the predicted output is very poor. Indeed, we can see that through the residual plot which shows that the residual for y greater than 9.5 is very high (more than 2 in absolute value). This is concerning and shows that the linear regression model is not necessarily the best one to select.

Now, let's rescale our linear regression model. **WHY**

**Analyse Graph**

## Lasso Model

It is now time to perform a lasso model. It is provided by the `linear_model`

submodule and requires the choice of a tuning parameter denoted `alpha`

to evaluate the weight of the $\ell_1$ penalty. We started with an intial value of $\alpha=0.15$ as a base which we can improve on later.

We can get the above lasso coefficients by calling the following:

Let's see which features have the highest corresponding coefficients in our lasso model. These features correspond to the factors that contribute the most to the show's popularity. Let's define a function `find_order_features()`

that determines all the features in our model matrix $\mathbf{X}$ that play a role in the success of the show and displays, in decreasing order, the most important features in terms of the highest corresponding coefficients in our model.

### NEEDS CHANGED

Based on this we can determine which variable(s) seem to be the most important for predicting `imdb_rating`

. The show's success seems to be due to:

- the total number of votes for the episode's rating, then
- the presence of Michael, then
- the presence of Nellie, and finally
- the number of lines in the episode.

We can visually see how the coefficients vary depending on the value of $\alpha$.

### Model Tuning

Let's now tune our lasso model. We will use the `GridSearchCV`

function to find the optimal value of the $\alpha$ hyperparameter. We will not use $\alpha = 0$ as it would cause a warning due to the fitting method (coordinate descent) not converging well without regularization (the $\ell_1$ penalty here).

However, the optimized alpha is very near the smallest value provided to our grid search which is 0.01. Therefore, our lasso model could be approximated by our initial linear regression model.

We investigate this further by plotting $\alpha$ versus the `mean_test_score`

values from the `cv_results_`

attribute.

It seems like our model's rmses monotonically decreases from 0.01 to the best value of apha 0.03, then monotonically increases, as $\alpha$ increases. Therefore, the CV proceedure prefers the lasso model, but the linear regression could also be a good choice.

Now that we know the best value of $\alpha$, we can use it to provide you with a reliable lasso model named `l2`

.

We observe that our tuned lasso model's rmse is well lower than for the linear regression model, therefore the lasso model is the most appropriate one.

Above is a table showing the features which most effect the IMBD rating. It is sorted in descending order, with the most important features at the top.

## 4. Discussion & Conclusions

*In this section you should provide a general overview of your final model, its performance, and reliability. You should discuss what the implications of your model are in terms of the included features, predictive performance, and anything else you think is relevant.*

*This should be written with a target audience of a NBC Universal executive who is with the show and university level mathematics but not necessarily someone who has taken a postgraduate statistical modeling course. Your goal should be to convince this audience that your model is both accurate and useful.*

*Finally, you should include concrete recommendations on what NBC Universal should do to make their reunion episode a popular as possible.*

*Keep in mind that a negative result, i.e. a model that does not work well predictively, that is well explained and justified in terms of why it failed will likely receive higher marks than a model with strong predictive performance but with poor or incorrect explinations / justifications.*

- Writer = Paul Liberston
- director = Greg Daniels
- main characters (most lines) - Michael, Jim, Karen, Holly, Stanley, Jan
- Don't include: Nellie, Robert, Erin

n_words: More words, better rating

n_speak_char: more characters, better rating.

### Results

Looking at the final dataframe `feats_df`

from our Lasso model, we see which features of an episode impact IMDB rating most. The larger the absolute value of the coefficient, the stronger correlation there is between that feature and the rating. A positive coefficient indicates and positive correlation, and a negative value indicates a negative correlation.

#### Writer:

From the dataframe, we can see the writer with the highest poisitive coefficient is Paul Liberstein. This indicates that the episodes he has written, have the greatest impact on generating a high-rated episode **FIX WORDING**. It is therefore our recommendation to NBC to hire him to write the reunion episode.

#### Director:

Following the same logic as for the writer, it is the case that Greg Daniels should be hired to direct the new episode.

#### Main Characters:

Looking at the coeffients for the names of each of the characters mentioned, the strongest positive coeffients we have are (in decreasing order): Michael, Jim, Karen, Jan, Holly and Stanley. It therefore makes sense that these characters should be given the most lines in the episode. The characters Nellie, Robert and Erin all have negative coeffients, suggesting that the episode that they feature heavily in are generally lower than those which they do not. It is then our recommendation that either these characters are excluded from the reunion episode, or at at least not given many lines.

#### Other Features:

Our third largest cofficient is the column `n_words`

. This suggest that episodes with a higher number of lines, recieve better IMDB scores than those with lower line counts. Also, `n_lines`

has a positive coefficient, albiet fiarly small. This indicates that a script with more conversation between characters, rather than individual monolgues is generallly recieved better by audiences. This is something to take into account when writing the episode.