Netflix User Data Analysis
Introduction
The purpose of this project is to analyze how Netflix user ratings relate to box office results. This repo contains a Jupyter Notebook that processes, analyzes, and visualizes streaming metadeta for the Netflix 2021 dataset. This dataset contains 100,000 user records, and I will be focusing on processing the genres for large assortment of titles.
It contains relevant details such as:
In this notebook, while using pandas I'll employ standard data manipulation practices for pre-processing such as cleaning, slicing, and aggregating. Afterwards, I will analyze this dataset for insights and findings.
Imported Libraries
DATA PRE-PROCESSING
Data Cleaning
First reformat the column names to a 'pythonic' appropriate version.
Next reformat data types of columns to the appropriate type for aggregation.
Reshape the dataframe to only necessary columns.
Calculating Weighted Average
Instead of the traditional average we need to compute weighted averages, since there are lots of NaNs in this dataset. I found a script online to implement this formula.
The formula and syntax is explained below:
Weighted Rating for a row (WR) = [(v + 1) / (v + m) R] + [m / (m + v) C]
v: number of votes for the movie
m: minimum votes required to be listed in the chart (quartile 0.75)
R: average rating of the movie
C: mean vote across the whole report
Then fill in the NaN rows with the top occurring genre. Since the genre for each title is stored as a list; the column is first unstacked, afterwards the occurrence for each genre is counted.
A quick look at the top genres in the dataset.
Creating Dummies From Categorical Data
Next to calculate the weighted average for each genre, dummies need to be created.
In short, it converts categorical data into dummy or indicator variables (1 or 0) for the presence or absence of a category.
DATA ANALYSIS AND VISUALIZATION
Rating w.r.t Genres
What is the average rating on IMDB for each genre?
What is the average rating on Rotten Tomatoes for each genre?
What is the average rating on Metacritic for each genre?
What is the average rating on Hidden Gem for each genre?
What is the average awards received for each genre?
What is the average box office revenue for each genre?
A quick look at the countries with the most content.
Rotten Tomatoes ratings for each genre for the top countries.
Average box office amount for each genre for the top countries.
Insights
From these visualizations it can be observed that since Biography, War titles have higher ratings, non-fiction titles are better received. However, adventure, sci-fi, and action titles are the most popular as they bring in the most revenue.
Content Types
What is the make up of content?
What type of content has been added the most over years?
Which month does the most content get added?
Insights
From these visualizations it can be observed that movies are the most popular type of content. It can also be seen that the amount of content added was steadily increasing year by year until 2021, which was when the pandemic occurred. The most content gets added in April.
What is the film rating for each type of content?
The most popular content added are either R or mature rated.
Conclusion & Future Considerations
In this notebook, I showed how to use a simple data manipulation methods and visualizations to do an analysis. By breaking down the the relationships of genres to other metadata some interesting insights were found. In the future,can improve on this notebook by diving deeper into the tags associated with each genre to explore a recommendation mechanism based on this.
Thank you for reading! 👋🏽