# MA 346 Final Project- Data Analysis

## Cassidy Gorsky & Vivian Xia

The project's goals are to, firstly, explore the relationship between distributor gross, share, and ticket sales with one another and across different years. And using the data gathered, this project also aims to create a machine learning model that predicts whether movie distributors will make more money per movie in the following year based on ticket sales and distributor shares of the year.

The data used in this project is sourced from https://www.the-numbers.com/. This website contains various tables of information on movies from different years, including distributor share, gross, and ticket sales information for each year.

The model that we created was modeled after 2017 and 2018 data.
It had a resulting strong *F1* score,
a value representing a balance of precision and recall,
when used to predict whether distributors made more per
movie the following year using the test dataset of 2018 and 2019 data.
However, the model scored lower when predicting using
2019 and 2020 data compared to the 2018 and 2019 data. This outcome was expected
due to the difficulties associated with filming and people being unable to go to the movies
during the pandemic in 2020.

The associated repository link with our data and code is the following: https://github.com/vivian-xia/MA346_Final_Project. An additional exploratory resource of our data is in the following link: https://dry-dusk-65793.herokuapp.com/.

First, we need to install and import the relevant library used in our analysis.

We need to read in the data cleaning notebook, `DataCleaning.ipynb`

,
to obtain the data frames and global variables
needed to complete our goals for this project.

### Creating a scatter plot

The data set `df_merge_gross`

contains 2019 and
2020 Gross/Movies for each distributor where Gross/Movies
was computed through dividing the annual gross with the number
of movies of each distributor from the original uploaded data
for each year.

By creating a scatter plot, we can see the relationship between how much a distributor made per movie in 2019 compared to that of 2020.

Looking at this trend line we can see that the slope is less than one. This relationship means that the movie distributors generally made more money per movie in 2019 compared to 2020.

### Creating a heat map

We created a heat map containing `2019 Gross/Movies`

, `2020 Gross/Movies`

, `2019 Gross Max`

, and `2020 Gross Max`

from the `df_merge_gross`

data frame.
This will allow us to see the correlations between the data columns.

A high correlation of 0.93 between `2020 Gross/Movies`

and `2020 Gross Max`

makes sense
because the highest grossing movie contributes to the gross/movies ratio,
so if the distributor's highest grossing movie is larger than the average gross per movie will also be larger.

Between `2020 Gross/Movies`

and `2019 Gross/Movies`

, there is a correlation of 0.6. This is a weak positive correlation.
This weak correlation means the the money made per movie in 2019 has a small effect on
how much money per movie the distributor will make in 2020.
This trend can also be seen in the scatter plot above.

### Distribution of Tickets Sold

We created a histogram that compared the tickets sold for 2019 and 2020 using
the dataframe `df_2019_top_movies`

and `df_2020_top_movies`

and their `Tickets Sold`

column.

There is a significant difference between the number of tickets sold in 2019 versus 2020. The frequency of ticket purchases in 2019 for each bin is evidently greater than in 2020. For zero to one million tickets sold, 2019 tickets sold more than 500 times while 2020 tickets sold a little more than 300 times.

If interested in exploring more of this data, the following is a link to our dashboard: https://dry-dusk-65793.herokuapp.com/.

### Machine learning

We will be using supervised learning when creating a model because we have the data's inputs and outputs, and the computer will learn their relationship. We will be creating our model based on 2017 and 2018 data from The Number website. These two data sets on distributor information merged will be used to form our training data set.

For our model used in Machine Learning, we need to create the model based on a training dataset.
Once we fit the model to the training dataset, we will be able to test the model on a testing dataset.
We merged the 2018 and 2017 distributor datasets on `Distributor`

to create `ml_dist_2018`

.

The data set has a column `Gross/Movies Difference`

that was created by subtracting `2017 Gross/Movies`

from `2018 Gross/Movies`

.
This will help us determine if the distributor made more money per movie in 2018 or 2017.
This column will be our response variable for our model, and we changed it to a boolean data type.
The response variable is true and set equal to 1 when a distributor made more money in 2018 compared to 2017.

The predictors in our training data will be the following
variables: `2017 Share`

, `2018 Share`

, `2017 Tickets`

, `2018 Tickets`

.

Some of the rows in `2018 Tickets`

and `2017 Tickets`

column are
very large compared to the data for smaller distributors, so
we took the log of those columns in our data cleaning notebook.
This step will reduce the influence of outliers.

The response variable is the `Gross/Movies Difference`

.

The model is fitted with the aforementioned predictors and response variable. Since our response variables are boolean values, the model we fitted is a classification model.

The model was used to predict the response, and that result was used to find the prediction rate of the model to the data.

This is a high prediction rate which means that the model predicts the data set well.

We now want to measure the success of our model.
In order to test the quality of the model, we found the precision, recall, and *F1*.
The `TP`

caluculated below represents the number of true positives among our model's prediction.
`FP`

represents the number of false positives, and `FN`

represents for the number of false negatives.
We can then use these measurements to calculate the quality of our model.

With a precision value of 0.975, we can interpret that if the test said “positive,”
the odds that it’s a true positive is 0.975.
A high precision of 0.975 is very good for this model.
With a recall of 0.83, we can interpret that if the reality is “positive,”
the odds that the test will detect that is about 0.83.
The *F1* score combines precision and recall using a formula called the geometric mean,
so *F1* is also a measure to compare models, prioritizing a balance of both precision and recall.
A *F1* score of 0.897 is good for a model.

The number of rows in the data set is needed to so that we can split the training dataset into two parts of 80% and 20% where the 80% part will be the training data and the 20% will be the validation data.

We only want to keep the predictor and response variables in the merged dataset.

The `training`

data is chosen through a random pick of 80% (or 70)
of the original training dataset. The selected `training`

data
is put into the dataframe `df_training`

.

The `validation`

data takes on the remaining 20% (or 18)
that were not picked and puts them into the dataframe `df_validation`

.

The previous steps of fitting a model to the data set and measuring the quality of the model are made into its respective functions so that they can be more easily called upon as we test and score other models.

After fitting our model to our `df_training`

dataset using the `fit_to_model()`

function, we first looked at our model's coefficients for each of the predictors to decide if we need to omit a predictor from the model.
All of the coefficients (in absolute value) are similar, so we decided to not omit any of the predictors from our model.
Our model is a good fit for our`df_training`

and `df_validation`

data sets since the *F1* scores are high.

We then experimented with various subsets of the predictor columns to find a subset that generalizes well to unseen data.
We first used the `2018 Share`

column and `2018 Tickets`

column.

The coefficients for `2018 Share`

and `2018 Tickets`

show
the relationship between the two variables and the response variable.

If the coefficient is positive, for every unit that variable increases, the
distributor is *more* likely to have made more money per movie in 2018 than 2017.
If the coefficient is negative, for every unit that variable increases, the
distributor is *less* likely to have made more money per movie in 2018 than 2017.

We then used the `2017 Share`

and `2017 Tickets`

columns.

Similarly,the coefficients for `2017 Share`

and `2017 Tickets`

show
the relationship between the two variables and the response variable.

We fit the model back to `df_training`

to test it on the testing data sets.

### Testing our model on new data

We now want to test our classification model on 2018 and 2019 distributor data.
We merged `df_2019_distributor`

and `df_2018_distributor`

on Distributor to create `ml_dist_2019`

.
We then performed the same procedures as the previous merge to obtain a dataset with just our predictors and response variable.
As before, we also took the log of tickets sold for both years to reduce the influence of outliers.

With our `ml_dist_2019`

dataframe,
we score the data based on our model.

A high *F1* score means this model is a good fit for our data in `ml_dist_2019`

.

Repeating the same process as above, we gathered the 2020 data and 2019 distributor
data into a merged dataframe called `ml_dist_2020`

in order to score how well the model
predicts using the 2019 and 2020 data.

This model fits less well to the testing data of `ml_dist_2020`

than
that of `ml_dist_2019`

. This *F1* score is lower compared to `ml_dist_2019`

's
score. This was expected as the pandemic has caused many productions to halt
filming as well as prevented people from going to the movies.

The model better predicted whether the distributor would end up making more money per movie in 2019 than in 2020.