Predicting Increase in Gross for Movie Distributors
MA346 Data Science Final Project - Data Analysis
Cassidy Gorsky & Vivian Xia
The project's goals are to, firstly, explore the relationship between distributor gross, share, and ticket sales with one another and across different years. And using the data gathered, this project also aims to create a machine learning model that predicts whether movie distributors will make more money per movie in the following year based on ticket sales and distributor shares of the year.
The data used in this project is sourced from https://www.the-numbers.com/. This website contains various tables of information on movies from different years, including distributor share, gross, and ticket sales information for each year.
The model that we created was modeled after 2015, 2016, 2017, and 2018 data. It had a resulting good F1 score, a value representing a balance of precision and recall, when used to predict whether distributors made more per movie the following year using the test dataset of 2016, 2017, 2018, and 2019 data. However, the model scored lower when predicting using 2017, 2018, 2019, and 2020 data compared to the 2016, 2017, 2018, and 2019 data. This outcome was expected due to the difficulties associated with filming and people being unable to go to the movies during the pandemic in 2020.
For additional information, visit the following:
- The associated repository link with the data and code is the following: https://github.com/vivian-xia/Predicting-Increase-in-Gross-for-Movie-Distributors.
- An additional exploratory resource of the data is in the following link: https://dry-dusk-65793.herokuapp.com/.
First, we need to install and import the relevant library used in our analysis.
We need to read in the data cleaning notebook,
to obtain the data frames and global variables
needed to complete our goals for this project.
Creating a scatter plot
The data set
df_merge_gross contains 2019 and
2020 Gross/Movies for each distributor where Gross/Movies
was computed through dividing the annual gross with the number
of movies of each distributor from the original uploaded data
for each year.
By creating a scatter plot, we can see the relationship between how much a distributor made per movie in 2019 compared to that of 2020.
Looking at this trend line we can see that the slope is less than one. This relationship means that the movie distributors generally made more money per movie in 2019 compared to 2020.
Creating a heat map
We created a heat map containing
2019 Gross Max, and
2020 Gross Max from the
df_merge_gross data frame.
This will allow us to see the correlations between the data columns.
A high correlation of 0.93 between
2020 Gross/Movies and
2020 Gross Max makes sense
because the highest grossing movie contributes to the gross/movies ratio,
so if the distributor's highest grossing movie is larger than the average gross per movie will also be larger.
2020 Gross/Movies and
2019 Gross/Movies, there is a correlation of 0.6. This is a weak positive correlation.
This weak correlation means the the money made per movie in 2019 has a small effect on
how much money per movie the distributor will make in 2020.
This trend can also be seen in the scatter plot above.
Distribution of Tickets Sold
We created a histogram that compared the tickets sold for 2019 and 2020 using
df_2020_top_movies and their
Tickets Sold column.
There is a significant difference between the number of tickets sold in 2019 versus 2020. The frequency of ticket purchases in 2019 for each bin is evidently greater than in 2020. For zero to one million tickets sold, 2019 tickets sold more than 500 times while 2020 tickets sold a little more than 300 times.
If interested in exploring more of this data, the following is a link to our dashboard: https://dry-dusk-65793.herokuapp.com/.
A supervised machine learning will be used when creating a model because of the available data inputs and outputs, which the computer will use to learn their relationship. The model being created will be based on 2015, 2016, 2017, and 2018 data from The Number website. These data sets on distributor information merged will be used to form our training data set.
For the model used in Machine Learning, the model must be created based on a training dataset.
Once the model is fitted to the training dataset, the model can be tested on a testing dataset.
The 2015, 2016, 2017, and 2018 distributor datasets are merged on
Distributor to create
The data set has a column
Gross/Movies Difference that was created by subtracting
2017 Gross/Movies from
This will help to determine if the distributor made more money per movie in 2018 or 2017.
This column will be the boolean type response variable for the model.
The response variable is true and set equal to 1 when a distributor made more money in 2018 compared to 2017.
The predictors in our training data will be the following
Some of the rows in
Tickets columns are
very large compared to the data for smaller distributors, so
the log of those columns was taken in the data cleaning notebook.
This step will reduce the influence of outliers.
The response variable is the
The model is fitted with the aforementioned predictors and response variable. Since the response variables are boolean values, the model we fitted is a classification model.
The model was used to predict the response, and that result was used to find the prediction rate of the model to the data.
This is a moderate prediction rate which means that the model predicts the data set fairly adequately.
The success of our model is then measured.
In order to test the quality of the model, the precision, recall, and F1 is found.
TP caluculated below represents the number of true positives among our model's prediction.
FP represents the number of false positives, and
FN represents for the number of false negatives.
These measurements are used to calculate the quality of the model.
With a precision value of 0.572, we can interpret that if the test said “positive,” the odds that it’s a true positive is 0.572. A moderate precision of 0.572 is fair for this model.
With a recall of 0.429, we can interpret that if the reality is “positive,” the odds that the test will detect that is about 0.429.
The F1 score combines precision and recall using a formula called the geometric mean, so F1 is also a measure to compare models, prioritizing a balance of both precision and recall. A F1 score of 0.471 is below average for a model.
The number of rows in the data set is needed to so that we can split the training dataset into two parts of 80% and 20% where the 80% part will be the training data and the 20% will be the validation data.
Keep the predictor and response variables in the merged dataset.
training data is chosen through a random pick of 80%
of the original training dataset. The selected
is put into the dataframe
validation data takes on the remaining 20%
that were not picked and puts them into the dataframe
The previous steps of fitting a model to the data set and measuring the quality of the model are made into its respective functions so that they can be more easily called upon as we test and score other models.
After fitting the model to the
df_training dataset using the
the model's coefficients for each of the predictors is observed to see
the relationship between the variables and the response variable.
The coefficients (in absolute value) are are similar except for
2017 Tickets, which has a very large coefficient,
2015 Share, which has the smallest coefficient.
It may be interesting to see how well the model does if
2015 Share is omitted from the model.
The model does not fit well to the
df_validation data sets since the F1 scores are low.
The model scored the after omitting
2015 Share. The predictor is not significant in improving
the model's precision and recall.
By observing the coefficients again,
2016 Share has a lower
coefficient values than the other predictors. Let's see the score when
2016 Share is omitted.
Similarly, the score for the model does not change, so the predictor
2016 Share is not significant in
It would also be interesting to see the scores and corresponding coefficients of the predictors without the 2015 year data.
2016 Tickets, and
2017 Tickets columns are used to see the
score of a model without 2015 data.
The model scored worse. 2015 data is useful in training the model to predict the response. But as seen
previously when using the predictors except
2015 Share, only
2015 Tickets may be significant
in predicting the response.
We fit the model back to the original six predictors from
df_training to test it on the testing data sets.
Testing our model on new data
To test our classification model on 2016, 2017, 2018, and 2019 distributor data,
df_2019_distributor and 2016, 2017, and 2018 data are merged on
Distributor to create
The same procedures as the previous merge are performed to obtain a dataset with just the predictors and response variable.
As before, the log of tickets sold was taken for the years to reduce the influence of outliers.
the model is scored.
This F1 score means this model is a good fit for the data in
Repeating the same process as above, the 2020, 2019, 2018, and 2017 distributor
data is merged into a dataframe called
ml_dist_2020 in order to score how well the model
predicts using the 2017, 2018, 2019 data.
This model fits less well to the testing data of
ml_dist_2019. This F1 score is lower compared to
score. This was expected as the pandemic has caused many productions to halt
filming as well as prevented people from going to the movies.
The model better predicted whether the distributor would end up making more money per movie in 2019 than in 2020. It may be useful to look at other predictor variables other than Share and Ticket sales.