HW3_DataStructure

# Import the numpy and pandas packages import numpy as np import pandas as pd from notebook import notebookapp

The "Movies" dataset is a tabular dataset containing information about movies. Each row of the dataset represents a single movie and each column contains a different attribute of the movie. Some of the attributes include the movie title, the director's name, the actors, the movie genres, the duration, the budget, the gross revenue, and the IMDb rating.

-> By analyzing this dataset, you can answer various questions related to movies, such as which movie has the highest gross revenue, which director has directed the most number of movies, which genre is the most popular among the audience, and so on.

# Unzip the zip file containing the data. Be sure to specify the correct path to the zip file # The path below is specific to my own Google Drive. #!unzip /content/drive/Othercomputers/My\ Mac/Documents/WSU/movies_data.zip #!unzip /content/drive/Othercomputers/My\/work/movies_data.zip #severs = list(notebookapp.list_running_servers()) #if severs: # url = severs[0]['url'] # print(f"Jupyter server running at {url}") #else: # print("No running servers found") !unzip movies_data.zip hw3_file

# list the contents of the Present Working Directory (also called pwd) !ls

movies = pd.DataFrame(pd.read_csv("./MovieAssignmentData.csv")) # pd.read_csv function is used to read the data movies.head() # notice the use of the .head() function for quickly displaying the data to get a feel

# Row, columns in the Data movies.shape

# Concise summary of the Dataframe. movies.info()

# Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, # excluding NaN values movies.describe() num_nans = movies['gross'].isna().sum() print("Number of NaN values in column 'gross':", num_nans) #movies.dropna()

movies.director_name.describe() longest_duration = movies['duration'].max() print("The longest duration of any movie in the dataset is:", longest_duration)

# TODO [1 line]: Write code to find how many different movie genres there are [10] num_directors = movies['director_name'].nunique() print("The sum of the directors in the dataset without repetitions: ",num_directors) # count the number of unique directors # print the result

# TODO [1 line]: Write code to find number of languages across all the movies in the dataset [10] num_genres = movies['director_name'].nunique() print("There are ", num_genres)

# Code for column-wise null count here movies.isnull().sum()

# TODO [1 line]: Write your code for row-wise null count here; # Use the .sum() function, paying attention to the axis argument [10] #Count the number of null values in each row null_counts = movies.isnull().sum(axis = 1) print("Number of null values in each row:", null_counts)

# TODO [1-2 lines]: Write your code for column-wise null percentages here rounded to 2 decimal places [10] temp = round(null_counts.mean() *100, 2) print("Column-wise null percentage which is rounded to 2 decimal places is: ", temp)

temp = movies['language'].fillna('English') movies['language'] = temp print(movies['language'])

movies.language.isnull().sum() # this should print 0 if your code above is correct

# TODO [1 line]: Write your code for dropping these rows here that have > 5 NaN values [10] movies_dropped = movies.dropna(thresh=5) #The thresh parameter is set to 5, which means that only rows with 5 or more NaN values will be dropped. print(movies_dropped)

# TODO [1 line]: Write your code for checking the fraction of retained rows here [10] fraction_retained = len(movies.dropna(subset=['title_year'])) / len(movies) print('Fraction of retained rows:', fraction_retained)

len(movies)-len(movies.drop_duplicates())

# TODO [1 line]: Write your code for dropping duplicate values here [10] movies.drop_duplicates(inplace=True)

# TODO [1 line]: Write your code for creating the profit column here [10] movies['profit'] = movies['gross'] - movies['budget']

# TODO [1 line]: Write your code for sorting the dataframe here [10] movies_by_profit = movies.sort_values(by='profit', ascending=False) print(movies_by_profit)

# print the top 10 movies data top10 = movies_by_profit.head(10) top10

# TODO [3 lines]: Write your code for extracting the top 250 movies as per the IMDb score here. Make sure that you store it in a new dataframe # and name that dataframe as 'IMDb_Top_250' [30] IMDb_Top_250 = movies[movies.num_voted_users > 25000].sort_values(by=['imdb_score'], ascending=False)[:250] #add 'Rank' column with the values 1~250 IMDb_Top_250['Rank'] = range(1,251) # print a sample of the dataset IMDb_Top_250

# print the top 5 IMDb_Top_250.head()

# TODO [1 line]: Write code to do the above [20] Top_Foreign_Lang_Film = IMDb_Top_250[IMDb_Top_250['language'] != 'English'] print(Top_Foreign_Lang_Film)

# print the top 5 foreign language films Top_Foreign_Lang_Film.head()['movie_title']