# MA 346 Final Project- Data Cleaning

## Cassidy Gorsky & Vivian Xia

Before exploring our data and the relationships between the variables and across different years, the data needs to be imported and cleaned.

First, we need to download our relevant data sets. Our datas sets are based on various information related to movies in 2019 and 2020.
Our data source, https://www.the-numbers.com/, contains various information on movies from different years.
The information is displayed in tables throughout the site.
Since the data was stored in tables, it was easiest for us to import our data using `pd.read_html`

instead of copying and pasting the tables into a csv file.
Our goal is to see how the pandemic impacted the movie industry by comparing different measurements of 2019 and 2020.

#### Importing the data

We made a function to import the specifc tables we needed from
The Numbers site. This allows us to grab as many tables as we need
by changing the input variables. The `year`

and `table_title`

input
variables for our function help distinguish which tables we
want from the website.

Because the two dataframes `df_2019_top_movies`

and `df_2020_top_movies`

had rows of extraenous words and not data from the table,
we dropped the last two rows of both those dataframes.

Looking at the data types, we realized the `2019 Gross`

for the dataframes with 2019 data and
`2020 Gross`

for the dataframes with 2020 data have an object data type.

As an example, the data type the column `2019 Gross`

for `df_2019_distributor`

is an object.
There is a dollar sign and commas included in each row of that column.

We want to change these columns to have a float data type so that we can further anaylze the numbers assciated with the movie distributors.

We then replaced the extra characters with '' so that they were deleted from the columns
of data. This step allowed us to then change the `2019 Gross`

and `2020 Gross`

columns to
data type float.

Now that the `2019 Gross`

and `2020 Gross`

columns have an float data type,
we were able to create a new column using our 2019 and 2020 distributor data frames to
find the average gross per movie by each distributor for each of those years.
The new column was created my dividing the annual gross by the number of movies produced by a distributor.
We added each column to `df_2019_distributor`

and `df_2020_distributor`

by their corresponding year.

To find the maximum gross from a movie of each distributor in 2019,
we used the data frame with all the movies of 2019 `df_2019_top_movies`

to group
the distributors to find their maximum `2019 Gross`

.

After, we made that Series into a data frame to merge it with `df_2019_distributor`

,
matching the results on the `Distributor`

column. The resulting data frame `df_dist_2019`

has the distributor and its corresponding `2019 Gross Max`

and `2019 Gross/Movies`

.

Similarly, we used groupby to find the maximum gross from a movie of each distributor in 2020 and merge to create the resulting data frame.

### Merging two dataframes

Once we created `df_dist_2019`

and `df_dist_2020`

, we then merged the two data frames on `Distributor`

.

There were more movie distributors that made money in 2019
compared to 2020, so less distributors were included in the 2020 dataframe.
We used a right merge because the `df_dist_2019`

contains more rows than `df_dist_2020`

.
We then replaced all `np.nan`

with zeros to represent that
those movie distributors did not make any money in 2020.

After we merged these two datasets, we will explore the relationship between the variables and across different years.

Refer to the scatter plot, heat map, and histogram in DataAnalysis.ipynb for the code, output of the graphs, and analysis on the relationships shown.

### Machine learning

First, we load our training dataset by importing the dataframes of
2018 and 2017 distributors from The Numbers website using our
earlier function `import_table()`

.
Later, we will be merging these two datasets to form our
training dataset.

Before merging the two datasets, we need to clean the data.
To start off, the `2018 Gross`

column in `df_2018_distributor`

was not an float datatype.
This was because of the extra characters included in the data,
so we removed them in order to change the column to an float data type.

The `Share`

column also had extra characters. We performed
the same procedure to remove the characters and change the column
to a float data type. The `Tickets`

column must also be converted
into a float for a later purpose.

Similar to the previous step, the`df_2017_distributor`

columns,
`2017 Gross`

, `Share`

, and `Tickets`

are cleaned and converted to float data type.

The average gross of each distributor was found by dividing the the corresponding year's total gross and the number of movies made by each distributor that same year. The average gross was computed into a new column in the corresponding dataframe.

The two datasets have some similar column headings. Before merging the two, we need to change the column headers so we will be able to distinguish between the 2017 data and the 2018 data.

For our model used in Machine Learning, we need to create the model based on a training dataset.
Once we fit the model to the training dataset, we will be able to test the model on a testing dataset.
We merged the 2018 and 2017 distributor datasets on `Distributor`

to create `ml_dist_2018`

.
We didn't include a `how=`

statement and therefore opted for an inner join in the merge
because we only want distributors with both 2017 and 2018 data.

After the merge, we created a new column called `Gross/Movies Difference`

which was calculated by subtracting `2017 Gross/Movies`

from `2018 Gross/Movies`

.
This will help us determine if the distributor made more money per movie in 2018 or 2017. This column will be our response variable for our model.
Since we want our response column to be of boolean data type, we overwrote the `Gross/Movies Difference`

column with 1's and 0's.
The row was changed to a 1 if `Gross/Movies Difference`

was greater than zero, and changed to a 0 if not.
We then dropped the extra columns from `ml_dist_2018`

that we won't need for our model.

Since some of the rows of in the `2018 Tickets`

column and the `2017 Tickets`

column are very large compared to the data for smaller distributors,
we decided to take the log of the ticket columns. We overwrote the two columns by taking the log of the original ticket sales.
This step will reduce the influence of outliers when fitting a classification model.

The module `scikit-learn`

was loaded in to import logisitc regression
that will be used to fit a model to the training data.

The predictors in our training data will be the following
variables: `2017 Share`

, `2018 Share`

, `2017 Tickets`

, `2018 Tickets`

.
The response variable is the `Gross/Movies Difference`

.

The model is fitted with the aforementioned predictors and response variable. Since our response variables are boolean values, the model we fitted is a classification model.

Refer to the data analysis notebook for fitting a model to our data and the conclusion we drew from it.

### Testing our model on new data

We now want to test our classification model on the 2018 and 2019 distributor data.
The `df_2019_distributor`

dataset was uploaded earlier, so
we first go through the same procedures as before for cleaning the data.
The `Share`

column needs to be changed to data type float.

Before merging with `df_2018_distributor`

, the columns in `df_2019_distributor`

need to be renamed in order to distinguish between the years.

We merged `df_2019_distributor`

and `df_2018_distributor`

on `Distributor`

to create `ml_dist_2019`

.
We then performed the same procedures as the previous merge to obtain a dataset with just our predictors and response variable.

As before, we took the log of the tickets columns to reduce the influence of outliers.

Now that we have our data from 2018 and 2019 in one dataframe, we can score the model created in our data analysis notebook based on this testing data set. Refer to the data analysis notebook for our score model.

Repeating the same process, we want to gather the 2020 data and 2019 distributor data into a merged dataframe in order to score the data.
Since we already had `df_2020_distributor`

uploaded, we first need to clean the data and rename the columns.

We then merged `df_2020_distributor`

with `df_2019_distributor`

to obtain `ml_dist_2020`

.
Once again, we performed the same procedures as the first merge so that our new dataframe only contains our predictors and response variable.

As before, we took the log of the tickets columns to reduce the influence of outliers.

And the model is then scored in the data analysis notebook using the testing dataset of `ml_dist_2020`

.