Machine Learning in Python - Project 1
Due Friday, March 6th by 5 pm.
Scott Matthews, Isaac Turnbull, Darwin Douglas.
0. Setup
1. Introduction
Briefly outline the approaches being used and the conclusions that you are able to draw.
1.1 Executive Summary
1.2 Sources of other information
To make sure that our model considered features that might impact episode rating but which were not included in the data from the_office.csv, we considered the following additional features:
Proportion of Lines Spoken By Characters
the_office.csv contains the number of lines spoken by each character in an episode, however, this is not particularly useful as it does not allow for relative comparisons between characters. We wanted to suggest how the writer should structure the script by allocating a certain proportion of lines to the main characters. To do this, we divided the lines spoken by a character in an episode by the total number of lines in that episode; see section Lineshare for the code that we used. We then chose to include only the top ten characters by total lines spoken, the information of which was contained in the website https://pudding.cool/2017/08/the-office/).
Episode Length
Most episodes in The Office are around twenty minutes long, however, we are aware that there are occasionally special episodes which last longer. We postulated that these longer episodes may increase episode rating as they allow for more plot points and character exploration. Therefore, we manually entered in the length in minutes of each episode by using the episode details on IMDB.
Cameos
The Office is well-known for including cameos in many of its episodes. A cameo is defined to be a celebrity guest stars who is credited in an episode. Therefore, we judged that this may have an impact on the episode score and decided to include it as an additional feature; it is a categorical variable, where one cameo counts is valued as 1, two cameos are valued as 2, and so on. Again, we consulted the information contained on IMDB to quantify the number of cameos per episode.
Episode Location
Episodes in The Office frequently take place in locations away from the Dunder Mifflin offices or the Scranton Offices. We quantified an episode as being off-location if at least half of it takes place away from these offices. Episode location is a binary variable, where 1 denotes an episode off-location and 0 denotes an episode on-location.
Individual Writers and Directors
We one-hot encoded both writers and directors. Only one director can direct an episode and we hypothesised that the choice of director will impact episode score. Similarly for writers, we expected the choice of writer to influence episode score - we acknowledge a limitation that occasionally more than one writer will work on an episode, but this significantly increases the complexity of the model and we believe that the additional accuracy gains would be marginal.
2. Exploratory Data Analysis and Feature Engineering
2.1 Episode Score Manipulation
We need to figure a way to combine the number of ratings with the average IMBD rating. This will then be our explanatory output variable. There are many ways to do this, considering just the IMBD rating with no account for number of ratings, taking the Geomteric Mean of IMBD rating, multiplying our IMBD rating by the number of ratings or by the weighted exponential appproach. We chose to do this by the weighted exponential approach: $$ \text{Episode Score} = w{i}.r+10(1-w{i})(1-e^{\frac{-q}{Q}}) $$ Where $w_{i}$ is the weighting we associate between favouring IMBD rating against number of ratings, $r$ is the IMDb average rating, $q$ is the number of ratings for that episode and $Q = \frac{\text{Mean Number of Ratings}}{ln(\frac{1}{2})}$.
Here we will set $w_{i}=0.7$ as this is sensible to account equally for number of ratings against average score.
Normality Assumption
Linear regression assumes normality for the residual errors, therefore, it is important to check whether our data fits this assumption. The plot below shows that the distribution of the distribution of the episode scores resembles that of a normal distribution.
2.2 Creating New Features
As discussed in section 1.2, we wanted to consider some new features that we felt could have an effect on the IMDB rating of the episodes. The code used to create these features is included below.
Including Cameos
Including Episode Length (Minutes) and Episode Location (0: In office, 1: out of Office)
Directors One-Hot Coding
Writers One-Hot Coding
After finding the top 10 writers based on score, we encountered a small issue in the data. There were countless examples of the writer coupling 'Lee Eisenberg;Gene Stupnitsky' but the order of the two names alternated through the dataframe. This small piece of code fixes this issue in the data by replacing one of the orderings with the other.
Now we add in our one hot coded directors and writers into our original dataframe.
Lineshare of Characters
Now, we want to find the proportion of lines spoken by each character to provide more interpretable data.
Dropping Useless Features
Now we drop all features that we can deduce would be useless in our predictive model for a new episode. We consider 'season','episode', and 'air-date' to be useless features as they cannot be controlled when producing a reunion episode. We one-hot encoded 'writer' and 'director' so we also drop these columns. We improved 'main_chars' to 'ten_main_prop' which contains the proportion of lines spoken by each character. Finally, there is no sensible way to judge the influnce of 'episode name' as this depends on the plot points of an episode.
2.3 Testing Correlation of Features
The dimension reduction outlined in section 2.2 means that we now consider a feature space of dimension 39 with 23 binary variables and 16 non binary variables. The very large dimension of our feature space can lead to problems. In order to try and avoid complications regarding the curse of dimensionality, we try and reduce the dimension of our feature space by analysing correlation between features.
We compute the correlation using Pearson's correlation coefficient $r$. Then we define the strength of correlation to be as follows:
- $0 \leq |r| \leq 0.3$: weak correlation
- $0.3 < |r| < 0.7$: moderate correlation
- $0.7 \leq |r| \leq 1$: strong correlation
At the end when we make reccomendations based on our model, we will link the 'dropped' features in this section to the ones that remain in our model.
Note that it does not make sense to find correlation between our binary variables. Therefore, we first consider the correlation heatmap between non-binary features, excluding proportion of lines spoken by characters.
In the following plot we used a correlation threshold of 0.6 to filter the heatmap. By using this we can see that we need to drop one of {n_words,n_lines}, {n_words,Episode Length}, {Episode Length, n_lines} and {n_directions, n_words}. Therefore we will drop n_words,n_lines. This means that we can still include n_directions and Episode length without double-counting their effect on episode score.
We now produce a heatmap of correlation between the main characters.
Here we can see that the following characters are moderately negatively correlated: {Michael & Oscar}, {Michael & Erin}, {Kevin & Michael}, {Andy & Michael}, {Dwight & Michael}. We also have that {Andy & Erin} are moderately positively correlated. Therefore we can only consider at most one of each of these pairs as otherwise the effect on a change on these variables with respect to Episode Score would be over or under counted.
2.4 Testing Degree of Relationship between Episode Score and Features
Following our exploration of correlation between features, we have reduced the dimension of our non-binary features from 16 to 8, and hence the dimension of our whole feature space is now 31.
To help identify which of the variables in our extended feature space have an impact on the score, we plotted a pairsplot of the data using the built in function of seaborn. However, since our feature space has 31 variables in it, the pairs plot was 961 plots, any therefore any attempt to deduce any conclusions was futile. In addition, we were unable to make any clear observations between any of the binary variables and the episode score.
As a result, we decide to produce a much clearer subset of the pairs plot, plotting only the 8 remaining non binary features against the episode score. We wanted to make accurate analysis with regars to the degree of the relationship between a feature and episode score so we incorporated a simple polynomial model of degree ranging from 1 to 4 and judged how well it fitted the scatterplot.
With regard to the non-character specific features, the plots for 'n_directions', 'n_speak_chars', and 'Number of Cameos' show potential to have a higher degree relationship. Examining this further we can see that for 'n_directions' the degree 4 polynomial is very similar to the degree 3 one, therefore we will only consider up to degree 3 for this variable to simplify the model. For 'n_speak_char' the degree 4 polynomial is overfitted and similarly we observe the degree 2 and 3 polynomials show similarities so we will only consider up to degree 3 again. Finally, for the 'Number of Cameos' plot, the degree 2 and 4 polynomial have a similar shape and therefore we ignore the degree 4 polynomial. We note that the each of these four variables are somewhat correlated with 'Episode Score' so we cannot drop any at this stage.
For Michael, the plot suggests a positive linear relationship between the proportion of lines spoken by a character and episode score; all of the curves of degree more than one oscillate about the red line (linear). We also observe that there is no evidence supporting any relationship between Jim, Ryan and the episode score, so we will remove these characters from our model. For Angela and Pam we can also see that, excluding a few outliers where their proportion of lines is larger, the regression line is flat. Therefore the Episode Score does not appear to change based on them, therefore we will also drop them in our model from now on.
2.5 EDA Conclusions
Our analysis of heatmap correlations and pairwise plots has lead to a reduction in the dimension of a feature space. We have also been able to determine the range of degrees of polynomials that we will buid polynomial models with. It is vital that we have taken the time to make the deductions at this stage as leaving all data and degrees in until the model building stage will lead to signifigant run-times.
3. Model Fitting and Tuning
We are now ready to fit a model to the data. We begin by splitting our data into training and test data where training data is a random selection of 80% of data points and 20% of the remaining data is kept aside to validate our model.
Next, we center and scale all features to a common scale before fitting any models to prevent bias from a wide range of $\beta$ values and differences in units of measurement.
We then consider a baseline scaled linear model against which we compare other models that we produce. Following this, we perform ridge regression and LASSO on the scaled linear model to penalise unneccessary coefficients. It is important to avoid using our test data to in any way inform our choice of the tuning parameter. Instead, we have always used KFold with our training data to obtain the necessary metrics for optimizing the $\alpha$ hyperparameter. This would also have been possible using the complete data X and would have slightly improved our RMSE estimates due to the slightly larger sample sizes in the test train splits but it would then mean were repeated using the validation data in the process of determining $\alpha$ which then puts us at risk for overfitting and therefore having an overly optimistic view of our model's uncertainty.
Following our analysis in section 2.4, we discovered that there may be a polynomial relationship between some features and the episode score. Therefore, it was important that we considered a polynomial model as well. Again, we performed ridge regression and LASSO and tuned the hyperparameters.
3.1 Building Models on Training Data
Train Data
Linear Model
Ridge Regression on Scaled Linear Model
Lasso on Scaled Linear Model
Standard Polynomial Model
Lasso regression on polynomial model
Ridge Regression on Polynomial Model
scott the fifth column is the episode location all the writer ones are like between 2.66-4 which makes sense
3.2 Testing all models on Test Data
Non-scaled Linear Model
Scaled Linear Model
Ridge Regression on Scaled Linear Model
Lasso on Scaled Linear Model
Standard Polynomial Model
Lasso Polynomial Model
Ridge Polynomial Model
RMSE Comparison
We choose to use the scaled linear model, since it has the second lowest RMSE, which also happens to be the most interpretable, the linear scaled ridge regression model.
3.3 Chosen Model Coefficients
4. Discussion & Conclusions
4.1 Interpretation of our Chosen Model
Our chosen model is a linear one with an intercept included. This model performs the best with respect to giving the lowest Root Mean Square Error (RMSE). The y-intercept of this model can be thought of as our 'baseline' score and changes to the episode we make affect it by the variable value multiplied by it's coefficient. This will give us a predicted Episode Score with an error based on the RMSE.
To choose the best model we shoud first compare the coefficents of the Binary variables for best writer and director and Episode location. The variables in each of these categories are mutually exclusive and hence we should choose the largest one, which will maximise the Episode score. We then need to decide upon values for each of our continuous variables which will give us a predicted high Episode Score, there are many combinations of these.
4.2 Choosing Binary Variables
Choosing Actors to include
As we discussed in the section Lineshare of Characters, our model only considers the influence of four of the main characters: Michael, Jim, Angela, and Ryan. We first note that Ryan has a small negative coefficient and therefore we choose not to include him in any reunion show. It remains to allocate the proportion of lines between the other characters.
Michael has the highest positive coefficient and is therefore has the most positive influence on episode score. Jim has the second highest positive coefficient while Angela has the least positive influence on episode score. We chose to find the episode where the proportion of lines spoken by Michael is highest - this happened to be the episode (number 131) where Michael speaks over 70% of lines, Jim speaks 6% of lines, and Angela speaks 0%. To recommend the proportion of lines spoken by other characters, we return to our analysis on the correlation between characters.
Let us first propose the characters we believe that you should include:
- Michael: Since he has the highest positive coefficient, we recommend giving the highest proportion of lines to him. We don't want to extrapolate our findings beyond the most proportion of lines that he has been given because there is the possibility that a reunion episode based solely on Michael would disappoint viewers who prefer to see interaction between characters. A case in point is 'David Brent: Life on the Road', the catch-up show of the British Office, which focusses solely on David Brent (the equivalent to Michael) and receives considerably poorer reviews. Therefore we recommend giving 70% of lines to Michael.
- Jim: Jim is moderately positively correlated with Dwight and Pam. Jim also has the second highest coefficient within our model, therefore, we suggest giving Jim the second highest proportion of lines. The proportions of lines given to Jim, Dwight, and Pam in episode 131 is 6%, 5%, and 8% respectively. We believe that we could increase the proportions given to these characters, since we recommmend excluding some of the other characters.
- Angela: Since Angela has the lowest coefficient, she benefits the episode score prediction the least. However her coefficient is of a similar order of magnitude as Jim's; Jim's coefficient is approximately 1.5 times larger than Angela's. Therefore, we expect that her presence would still be well received in a reunion episode. Angela is moderately positively correlated with both Oscar and Kevin, a fact that is unsurprising as they are all part of the accounting department. As a result, including Angela in an episode suggests that you should also include Oscar and Kevin.
Now, let us give justifications for excluding the other main characters:
- Ryan: Due to his negative coefficient and the fact that he is only weakly correlated to other characters suggests that you should not include him in a reunion episode.
- Andy and Erin: There is moderate negative correlation between Michael and Andy as well as between Michael and Erin. Hence, we propose that you should exclude Andy and Erin since they do not usually appear in episodes in which Michael has a large proportion of lines.
Actor Choice Summary
Our model predicts that giving Michael the highest proportion of lines is most beneficial. Due to problems posed by extrapolating the data, we recommend that he receives 70% of lines, which allows 30% to be distributed among remaining characters. Since Jim's coefficient is over 1.5 times larger than Angela, we recommend that his proportion of lines spoken is also 1.5 times larger. We must also consider the characters moderately positively correlated with Angela and Jim, respectively, and since each character has 2 characters positively correlated with them we propose the following line proportions:
character | lines (%) |
---|---|
Michael | 70 |
Dwight | 6 |
Jim | 6 |
Pam | 6 |
Andy | 0 |
Angela | 4 |
Kevin | 4 |
Erin | 0 |
Oscar | 4 |
Ryan | 0 |
Total | 100 |
Choosing Writer and Director
We can see that Greg Daniels has the largest coefficients amongst both writers and directors and therefore has the biggest positive impact on episode score. Consequently, our model suggests that you should pick Greg Daniels to both write and direct a reunion episode. If we compare both his writing and directing coefficient in size to the character coefficients, we find that the choice of director or writer has less of an impact than giving Michael a high proportion of lines. However, choosing Greg Daniels has a greater impact than giving Jim or Angela a high proportion of lines.
An interesting further avenue of exploration would be to see if there is any interaction between when one indivudal both writes and directs an episode. This would allow us to determine whether this has a positive or negative correlation with episode score. Furthermore, there is the possibility that the style of a director or the plot of an episode works well with certain characters - to investigate this further we would need to model the interaction between characters and writers and directors.
Episode Location
4.3 Analysing Continuous Variables
4.4 New Episode Reccomendation
After eliminating all the variable we have exlpained in 4.3 we are left with the following model:
4.5 Prediction Accuracy / Limitations
http://www-stat.wharton.upenn.edu/~stine/stat621/lecture3.621.pdf
If the model scores are roughly normal then most of the residuals lie within about $\pm 2 RMSE$ of their mean. This means that we can form prediction intervals as $\hat{y} \pm 2 RMSE$.