# Predicting Increase in Gross for Movie Distributors

#### MA346 Data Science Final Project - Data Analysis

#### Cassidy Gorsky & Vivian Xia

The project's goals are to, firstly, explore the relationship between distributor gross, share, and ticket sales with one another and across different years. And using the data gathered, this project also aims to create a machine learning model that predicts whether movie distributors will make more money per movie in the following year based on ticket sales and distributor shares of the year.

The data used in this project is sourced from https://www.the-numbers.com/. This website contains various tables of information on movies from different years, including distributor share, gross, and ticket sales information for each year.

The model that we created was modeled after 2015, 2016, 2017, and 2018 data.
It had a resulting good *F1* score,
a value representing a balance of precision and recall,
when used to predict whether distributors made more per
movie the following year using the test dataset of 2016, 2017, 2018, and 2019 data.
However, the model scored lower when predicting using
2017, 2018, 2019, and 2020 data compared to the 2016, 2017, 2018, and 2019 data. This outcome was expected
due to the difficulties associated with filming and people being unable to go to the movies
during the pandemic in 2020.

For additional information, visit the following:

- The associated repository link with the data and code is the following: https://github.com/vivian-xia/Predicting-Increase-in-Gross-for-Movie-Distributors.
- An additional exploratory resource of the data is in the following link: https://dry-dusk-65793.herokuapp.com/.

First, we need to install and import the relevant library used in our analysis.

We need to read in the data cleaning notebook, `DataCleaning.ipynb`

,
to obtain the data frames and global variables
needed to complete our goals for this project.

### Creating a scatter plot

The data set `df_merge_gross`

contains 2019 and
2020 Gross/Movies for each distributor where Gross/Movies
was computed through dividing the annual gross with the number
of movies of each distributor from the original uploaded data
for each year.

By creating a scatter plot, we can see the relationship between how much a distributor made per movie in 2019 compared to that of 2020.

Looking at this trend line we can see that the slope is less than one. This relationship means that the movie distributors generally made more money per movie in 2019 compared to 2020.

### Creating a heat map

We created a heat map containing `2019 Gross/Movies`

, `2020 Gross/Movies`

, `2019 Gross Max`

, and `2020 Gross Max`

from the `df_merge_gross`

data frame.
This will allow us to see the correlations between the data columns.

A high correlation of 0.93 between `2020 Gross/Movies`

and `2020 Gross Max`

makes sense
because the highest grossing movie contributes to the gross/movies ratio,
so if the distributor's highest grossing movie is larger than the average gross per movie will also be larger.

Between `2020 Gross/Movies`

and `2019 Gross/Movies`

, there is a correlation of 0.6. This is a weak positive correlation.
This weak correlation means the the money made per movie in 2019 has a small effect on
how much money per movie the distributor will make in 2020.
This trend can also be seen in the scatter plot above.

### Distribution of Tickets Sold

We created a histogram that compared the tickets sold for 2019 and 2020 using
the dataframe `df_2019_top_movies`

and `df_2020_top_movies`

and their `Tickets Sold`

column.

There is a significant difference between the number of tickets sold in 2019 versus 2020. The frequency of ticket purchases in 2019 for each bin is evidently greater than in 2020. For zero to one million tickets sold, 2019 tickets sold more than 500 times while 2020 tickets sold a little more than 300 times.

If interested in exploring more of this data, the following is a link to our dashboard: https://dry-dusk-65793.herokuapp.com/.

### Machine learning

A supervised machine learning will be used when creating a model because of the available data inputs and outputs, which the computer will use to learn their relationship. The model being created will be based on 2015, 2016, 2017, and 2018 data from The Number website. These data sets on distributor information merged will be used to form our training data set.

For the model used in Machine Learning, the model must be created based on a training dataset.
Once the model is fitted to the training dataset, the model can be tested on a testing dataset.
The 2015, 2016, 2017, and 2018 distributor datasets are merged on `Distributor`

to create `ml_dist_2018`

.

The data set has a column `Gross/Movies Difference`

that was created by subtracting `2017 Gross/Movies`

from `2018 Gross/Movies`

.
This will help to determine if the distributor made more money per movie in 2018 or 2017.
This column will be the boolean type response variable for the model.
The response variable is true and set equal to 1 when a distributor made more money in 2018 compared to 2017.

The predictors in our training data will be the following
variables: `2015 Share`

,`2016 Share`

,`2017 Share`

, `2015 Tickets`

, `2016 Tickets`

, `2017 Tickets`

.

Some of the rows in `Tickets`

columns are
very large compared to the data for smaller distributors, so
the log of those columns was taken in the data cleaning notebook.
This step will reduce the influence of outliers.

The response variable is the `Gross/Movies Difference`

.

The model is fitted with the aforementioned predictors and response variable. Since the response variables are boolean values, the model we fitted is a classification model.

The model was used to predict the response, and that result was used to find the prediction rate of the model to the data.

This is a moderate prediction rate which means that the model predicts the data set fairly adequately.

The success of our model is then measured.
In order to test the quality of the model, the precision, recall, and *F1* is found.
The `TP`

caluculated below represents the number of true positives among our model's prediction.
`FP`

represents the number of false positives, and `FN`

represents for the number of false negatives.
These measurements are used to calculate the quality of the model.

With a precision value of 0.572, we can interpret that if the test said “positive,” the odds that it’s a true positive is 0.572. A moderate precision of 0.572 is fair for this model.

With a recall of 0.429, we can interpret that if the reality is “positive,” the odds that the test will detect that is about 0.429.

The *F1* score combines precision and recall using a formula called the geometric mean,
so *F1* is also a measure to compare models, prioritizing a balance of both precision and recall.
A *F1* score of 0.471 is below average for a model.

The number of rows in the data set is needed to so that we can split the training dataset into two parts of 80% and 20% where the 80% part will be the training data and the 20% will be the validation data.

Keep the predictor and response variables in the merged dataset.

The `training`

data is chosen through a random pick of 80%

of the original training dataset. The selected `training`

data
is put into the dataframe `df_training`

.

The `validation`

data takes on the remaining 20%
that were not picked and puts them into the dataframe `df_validation`

.

The previous steps of fitting a model to the data set and measuring the quality of the model are made into its respective functions so that they can be more easily called upon as we test and score other models.

After fitting the model to the `df_training`

dataset using the `fit_to_model()`

function,
the model's coefficients for each of the predictors is observed to see
the relationship between the variables and the response variable.

The coefficients (in absolute value) are are similar except for `2017 Tickets`

, which has a very large coefficient,
and `2015 Share`

, which has the smallest coefficient.

It may be interesting to see how well the model does if `2015 Share`

is omitted from the model.
The model does not fit well to the `df_training`

and `df_validation`

data sets since the *F1* scores are low.

The model scored the after omitting `2015 Share`

. The predictor is not significant in improving
the model's precision and recall.

By observing the coefficients again, `2016 Share`

has a lower
coefficient values than the other predictors. Let's see the score when `2016 Share`

is omitted.

Similarly, the score for the model does not change, so the predictor `2016 Share`

is not significant in
the model.

It would also be interesting to see the scores and corresponding coefficients of the predictors without the 2015 year data.

The `2016 Share`

, `2017 Share`

, `2016 Tickets`

, and `2017 Tickets`

columns are used to see the
score of a model without 2015 data.

The model scored worse. 2015 data is useful in training the model to predict the response. But as seen
previously when using the predictors except `2015 Share`

, only `2015 Tickets`

may be significant
in predicting the response.

We fit the model back to the original six predictors from `df_training`

to test it on the testing data sets.

### Testing our model on new data

To test our classification model on 2016, 2017, 2018, and 2019 distributor data,
`df_2019_distributor`

and 2016, 2017, and 2018 data are merged on `Distributor`

to create `ml_dist_2019`

.
The same procedures as the previous merge are performed to obtain a dataset with just the predictors and response variable.
As before, the log of tickets sold was taken for the years to reduce the influence of outliers.

With the `ml_dist_2019`

dataframe,
the model is scored.

This *F1* score means this model is a good fit for the data in `ml_dist_2019`

.

Repeating the same process as above, the 2020, 2019, 2018, and 2017 distributor
data is merged into a dataframe called `ml_dist_2020`

in order to score how well the model
predicts using the 2017, 2018, 2019 data.

This model fits less well to the testing data of `ml_dist_2020`

than
that of `ml_dist_2019`

. This *F1* score is lower compared to `ml_dist_2019`

's
score. This was expected as the pandemic has caused many productions to halt
filming as well as prevented people from going to the movies.

The model better predicted whether the distributor would end up making more money per movie in 2019 than in 2020. It may be useful to look at other predictor variables other than Share and Ticket sales.