Machine Learning in Python - Project 1
Due Friday, March 5th by 5 pm.
Wenzhuo Wang, Qiushi Xu and Shanlin Li
0. Setup
1. Introduction
In this report, we will use the data about the American TV show the office. These data were derived from the data available in the ‘schrutepy’ package. This dataset includes 13 columns which include the main information of every episode. We can get information about creative team and cast staff like directors, writers and main characters. We can also find information about the lines, such as number of spoken lines, number of lines containing a stage direction, number of dialog words and number of different characters with spoken lines. Season number, episode number and corresponding name and air date are also included in this dataset. We can also know the audience figures of one episode by looking at the number of ratings on IMDB. The most important information is the episode rating on IMDB. We want to understand what makes some episodes more popular than others. The ratings may be influenced by the director, the author, or other features. So we are going to use this dataset to build a predictive model that captures the underlying relationships between some features and the IMDB ratings. Then we can use this model to advise on the creation of a special reunion episode of The Office. We hope to use this model to improve the popularity of the show.
To conduct the research, we will try many different methods, such as linear regression model, kernel ridge regression model, lasso regression model and polynomial regression model. What we expect to get is the smallest rooted mean square error, which means a better fit. In addition, we could give some professional suggestion according to the optimal model.
2. Exploratory Data Analysis and Feature Engineering
Exploratory data analysis plays an important role in seeking the relationship between explanatory variables and response variable. In this part, visualization is applied to help coarsely build a basic model and choose reasonable explanatory variables.
Pair plot is feasible to demostrate the latent correlation between response variable and numerical explanatory variables. As the pair plot shows, the apparent curvetrue of total_votes
and imdb_rating
indicating the nonlinear relationship between them. The correlation between the remaining variables and the response variable are not obvious, it is better to use Lasso regression to justify whether they have significant impact to imdb_rating
.
This dataset contains a lot of categorical variables, such as director
, writer
and main_chars
. If an episode contains multiple directors or multiple main characters, we expand the corresponding data. In addition, we find some recording errors in director
, then we just correct it.
In terms of categorical variables, the box plot is capable to exhibit the distribution of imdb_rating
under different attributes of the same feature. From the plots below, the effect to the rating potentially vary from different directors and different writers, since the distribution of rating among each director group and writer group are disparate. Therefore, it is worth introducing director
and writer
into the baseline model. Additionally, characters are the core of a TV show, which can determine whether a TV show is attractive and impressive. So it is natural to include main_chars
in the baseline model.
The air date of episode should have impacted to the rating, but it appears that most of episodes are shown on Thursday and the averaged ratings on different weekdays are very close. Therefore, air_date
is less useful in predicting rating and should be excluded from the model.
Finally, we need to set the elements of categorical variables as dummy variables. Hence, we get a new data frame with 6 numercial variables and 112 dummy variables. In addition, the volume of the dataset is also increasing to 3100. Besides, as the regularization approach is not scale equivariant and the estimates and predictions might be affected by the scale of input, the data should be standardised by centering and scaling. The explicit code is shown in section 3.
3. Model Fitting and Tuning
In this section, to seek a relatively optimal model, linear regression, regularization approach, Kernel Ridge regression and polynomial regression are attempted. It appears that basic linear model just simply retains all features, leading to an extremely complex model which is difficult to interpret and apply. Besides, compeared with other models without feature selection, the root of mean squared error (rmse) of linear model is slightly higher than that of other models, indicating the performance of linear model is not so good.
We also tried to fit Lasso and Ridge regression model. By using grid search method, we find the optimal value of penalty parameter alpha
for both the Lasso and Ridge model are very close to zero. So when we choose these optimal value for Lasso and Ridge model, they are quite similar to linear regression model because the penalization terms are very small. And the rmse of these two models are 0.25076 and 0.25069 which are bigger than that of polynomial regression model with all features (0.20940). So we decide to drop Lasso and Ridge model.
Afterwards, two nonlinear models are considered. For kernelized ridge regression, the root of mean squared error of the model without feature selection is about 2.39740, which is much higher than other models. After feature selection, since the kernelized ridge model with 11 explanatory variables has a very low value of root of mean squared error about 0.00068, it does not have a nice expression for the coefficients, that is difficult to interpret the relationship between response variable and covariates.
Therefore, the final choice is polynomial regression model, which trades off the level of error and interpretability.
To implement the polynomial regression model, we need to do a feature selection at first because the new data set has too many variables, while our total number of data set is only 3100. We need to select the variables which has more effect on imdb_rating
. The way to evaluate the degree of variables' influence , we use Lasso with a fixed small alpha
to filter. The larger the alpha
is, the fewer variables will remain. We need to center and scale all numerical features and then combine it with the dummy variables. Next, we separate the test set and train set to implement the cross validation and choose parameters to tuning model.
Before process the data, two help functions are built to extract the estimate coefficients and rmse, and exhibit the performance of fitting.
When we carry out feature selection by applied grid research and Lasso regression, it appears that the root mean squared error will increase as penalty parameter alpha
increases, implying the CV proceedure prefers simple linear regression. As the below figures show, at first, the rmse shifts rapidly as alpha
increases, and then the growth rate of rmse slows down when alpha
is greater than 0.01. In addition, the Lasso regression will shrink a large amount of regression coeifficients to zero even set extremely small alpha
. Ultimately, after some attempts, selecting 22 variables when alpha
=0.005 is fixed is appropriate and the rmse is about 0.34 which is relatively low.
When we fix alpha
=0.005, the Lasso model will keep 22 most important explanatory variables, which include all numerical variables and 8 dummy variables about directors and 9 dummy variables about writers. All main characters in the episodes are removed, indicating the main characters might have little impact to the rating.
Afterwards, we introduce the selected 22 features to our polynomial regression model. To avoid a too complicated model, we build a polynomial regression model without interaction. To start with, we use the data only containing the selected features and rating to generate train set and test set. And then the 5 validation sets is produced based on train set.
After these processes, a grid research is applied to determine the best degree for each numerical feature based on the cross validation rmse. Subsequently, the best degrees of numerical features would be pushed to a column transformer to create corresponding polynomial terms. Then the transformed data is passed to linear regression model builder as input automatically by a pipeline function. Finally, the results are presented by help functions which are able to extract the estimate coefficients and the rmse. The final test rmse is about 0.2772, which is lower than that of Lasso regression (about 0.34). In general, the polynomial regression model outperforms other models.
4. Discussion & Conclusions
Finally, we choose the polynomial model. We firstly use Lasso regression method to select features which are highly correlated with IMDB ratings. Then we use these selected features fit the polynomial model. The expression of the polynomial model are shown below.
$y=8.4829+0.555x1-2.4883x_1^2+0.0362x_1^3+0.232x_2+0.0057x_2^2-0.0187x_2^3+0.0851x_3-0.0169x_3^2-0.122x_4$ $ -0.0634x_4^2+0.0357x_4^3+0.0888x_5-0.0358x_5^2-0.012x_5^3+0.1916x_6-0.8871x_7-0.0592x_8+0.1233x_9+0.1967x{10}$ $ -0.0976x{11}+0.0927x{12}+0.4427x{13}-0.4743x{14}+0.0429x{15}+0.0206x{16}-0.2159x{17}+0.0604x{18}+0.244x{19}$ $ +0.0679x{20}-0.0106x{21}-0.1942x{22}$
The root of mean squared error of our model is 0.2772. We can also evaluate our model by checking the fit plot and residual plot. Generally speaking, we think it is a good model. However, this model has some shortcomings. It performs poorly in predicting low ratings. As you see, in the lower left corner of the Fit plot, the distance between points and the diagonal line are bigger than other points. This is also shown in residual plot. This is most likely due to the fact that low score data is rarely available in the dataset. On the contrary, the dataset contains a lot of data ratings between 7.5 to 9 points, so the prediction error of this segment is small.
Because the formula contains many polynomial terms, it is difficult to intuitively explain the influence of numerical features on the response variable. Therefore, we need to plot response variable against each numerical variable respectively to justify how it could affect the response variable.
From the above figures, we could get the specific standardized value which could make the response variable imdb_rating
achieve the optimal. We then need to transform them to the original value.
In summary, we suggest choosing Steve Carell and Harold Ramis as the directors. Director Claire Scanlon will make the quality of the TV show bad. Lee Eisenberg and Gene Stupnitsky are a good choice for writers. Do not choose Allison Silverman and Charlie Grandy as writers. For the lines, we suggest controlling the number of spoken lines around 490. The number of lines containing a stage direction should be around 110. And the number of dialog words should be around 2700. A number of different characters with spoken lines around 27 would be a good choice.