Homework 2: Data Cleaning Basics
In this homework, we will be going over a couple of concepts we learned in lecture 2! Everything in this homework was covered in lecture so feel free to reference the slides and remember you can also google!
Lecture slides: https://docs.google.com/presentation/d/1TIML5REJKThU4XVHe28m2-zy-NGByi-9Y8upwGGPGrs/edit#slide=id.g215ffca71a4_0_1
import pandas as pd
import numpy as np
from datascience import *
movies = pd.read_csv("movies_dataset.csv")
movies.head(10)
Question 1:
#In the table above, there are nan values. Normally we would investigate more before
#removing these rows but for the sake of this homework let us just remove them! Write a line of code below
#in "nans_removed" to remove all nan values.
nans_removed = movies.dropna()
nans_removed
Question 2:
#Now that we've removed the nan values let's remove some columns. Suppose the columns: {"writer", "storyline",
# "industry"} don't interest us. Write a line of code in "columns_dropped" that removes these columns
columns_dropped = nans_removed.drop(columns = ['writer', 'storyline', 'industry'])
columns_dropped
Question 3:
#We've decided to focus on more popular movies. So we want to only keep movies that have an
#IMDb-rating of at least 5. Write a line of code in "popular_only" to do this!
#Reference the lecture slides if you're stuck!
popular_only = columns_dropped[(columns_dropped['IMDb-rating'] >= 5.0)]
popular_only
Question 4
#For this question please respond with a text cell outlining the steps involved in Data Cleaning.
#Please also include steps for what you would do with outliers, missing/NA values and wrong type values.
The first step of data cleaning is to remove any duplicates or irrelevant data.
The next is the fix any datatypes.
For example, this would be making sure numbers are numerical datatypes, looking for readability, etc.
Third is to filter out any unwanted outliers and removing them for a legit reason.
Fourth is to handle missing data and recognizing that you can either drop or use mean/median to handle this.
The fifth and final step is the validate; this means making sure the data makes sense and is structured.
CONGRATS! You've finished the coding part of your homework. For the last part of your homework include a summary of 5 things you learned from this week's DSS's Article:
Even if you ban certain machine platforms, there will always be a way for people to find a way to go around and use it, such as ChatGPT.
ChatGPT and other AI chatbots are not always negative and can even be more useful than the traditional school setting.
In a world that is becoming more and more AI focused, using these AI chatbots will give students and early exposure to it.
AI chatbots can find personalized study methods for ever specific person.
AI chatbots can be useful to teachers, as AI and machines can be faster at providing feedback in comparison to a person.