Homework 2: Data Cleaning Basics
In this homework, we will be going over a couple of concepts we learned in lecture 2! Everything in this homework was covered in lecture so feel free to reference the slides and remember you can also google!
Lecture slides: https://docs.google.com/presentation/d/1TIML5REJKThU4XVHe28m2-zy-NGByi-9Y8upwGGPGrs/edit#slide=id.g215ffca71a4_0_1
import pandas as pd
import numpy as np
from datascience import *
movies = pd.read_csv("movies_dataset.csv")
movies.head(10)
Question 1:
#In the table above, there are nan values. Normally we would investigate more before
#removing these rows but for the sake of this homework let us just remove them! Write a line of code below
#in "nans_removed" to remove all nan values.
nans_removed = movies.dropna()
nans_removed
Question 2:
#Now that we've removed the nan values let's remove some columns. Suppose the columns: {"writer", "storyline",
# "industry"} don't interest us. Write a line of code in "columns_dropped" that removes these columns
columns_dropped = nans_removed.drop(columns=["writer", "storyline", "industry"])
columns_dropped
Question 3:
#We've decided to focus on more popular movies. So we want to only keep movies that have an
#IMDb-rating of at least 5. Write a line of code in "popular_only" to do this!
#Reference the lecture slides if you're stuck!
popular_only = columns_dropped.loc[columns_dropped["IMDb-rating"] > 5]
popular_only
Question 4
#For this question please respond with a text cell outlining the steps involved in Data Cleaning.
#Please also include steps for what you would do with outliers, missing/NA values and wrong type values.
Remove duplicate values and irrelevant data/observations.
Fix data types by using type() and asType()
filter any unwanted outliers, which are values over Q3 + 1.5 * IQR or any values under Q1 - 1.5 * IQR
handle all missing data by dropping the observations if the dataset is large, but if the dataset is not large, input values based on either the mean, median, linear regression, or any value.
validate
CONGRATS! You've finished the coding part of your homework. For the last part of your homework include a summary of 5 things you learned from this week's DSS's Article:
Ultimately, we can't prevent students from using A.I. chatbots, as students will always find a way to take advantage of it.
ChatGPT can actually be used as a teaching aid, almost as a calculator and help students out in a beneficial way.
People can accept ChatGPT with open arms as there are academic uses for a chatbot while learning.
It can be important to allow students to have AI exposure as it is almost guaranteed students will encounter AI in the real world.
Students can often teach themselves by using ChatGPT.