Importing the data:
Dropping Columns
We don't have to drop any columns except the ID column since it is repeated, apart from that we see that all columns are equally important for classifying and making keywords.
We add the overview into this movie
Converting JSON to Lists
Lets make a function to parse the json and retrieve the top 4 cast and their names only
Great now lets convert the cast column into a list of names:
Lets convert the genres columns into a list of genres
Lets do the same for the crew column, but lets retrieve the director name only:
Now we convert the overview column into a list of words and store it as a list
Making a new dataframe
Machine Learning Imports
Making function for stemming of words
Tokenization and Stemming of Words
Now, lets vectorize the words and convert it into numbers.
We extract the top 5000 words from the tags
Vectorization of Words
We vectorize the tags using the CountVectorizer library and then vectorize the given tagline adn convert into numpy array.
Most of the array will consist of 0s because not all the movie contains the 5000 words as its tags. It will only have selected words.
Calculating Similarity using cosine distance
Using cosine distances -- distance between two movies.
Cosine Distance: Distance between 2 vectors as an angle
Distance is inversely proportional to similarity -> high similarity, low distance
Here, we have a matrix this represents an array or arrays, where each array is the distance between a given movie and all other movies. So the shape of the array is 4808x4808.
Lets see the distance of the movies from first movie.