### Assignment 9

For this assignment , we will use the college scorecard dataset.

# 1. Prep the Environement/Data

# 1 - Import necessary libraries
import os, random
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 2 - Retrieve the names of the 22 files and assign them to a List variable
# You would need to filter out other files
# Display the names of these 22 files
Files= os.listdir('CollegeScorecard_Raw_Data')
Files

# 3 - Write code to randomly select one file name out of the 22 files names.
# Hint: generate a random integer between 0 and 21 first
# Display the file name
random_file=random.choice(os.listdir("CollegeScorecard_Raw_Data"))
print ("File Name:",random_file)

# 2. Data Cleaning

# 4 - Only read the two columns: college name (INSTNM) and in state tution ("TUITIONFEE_IN")
# from this file into a data frame aand display use info() function to display a summary information
df = pd.read_csv(random_file,index_col=None, header=0, usecols=["INSTNM","TUITIONFEE_IN"])
df

# 5 - Find out how many observations have missing value.
print(" \nCount total NaN at each column in a DataFrame : \n\n", df.isnull().sum())

# 6 - Drop those observations with missing value
# Disaply the number of observations afterward
df["TUITIONFEE_IN"] = df["TUITIONFEE_IN"].fillna(0)
print(" \nCount total NaN at each column in a DataFrame : \n\n", df.isnull().sum())

# 7 - Find out how many observations have 0 tution.
df[df["TUITIONFEE_IN"]== 0]

# 8 - Drop those observations with 0 tuition
# Disaply the number of observations afterward
df= df[df['TUITIONFEE_IN'] != 0]
df

# 3. Analyze data stats:

# 9 - Calculate and display the mean (average) of the tuitions of all the remaining observations
df.describe()

# 10 - Randomly select just 1 observation from the data frame.
# Display the tuition of that observation.
dfr= df.sample()
dfr

# 11 - Calculate the different between the tuition of this observation and
# the mean tuition off all observations calculated earlier.
# Display the difference
TF1= dfr["TUITIONFEE_IN"]
Diff= TF1-(df["TUITIONFEE_IN"].mean())
Diff

# 12 - Repeat 10 to 11 several times get the feel of the size of the difference in means
dfr1= df.sample()
TF2= dfr1["TUITIONFEE_IN"]
Diff= TF2-(df["TUITIONFEE_IN"].mean())
Diff

dfr2= df.sample()
TF3= dfr2["TUITIONFEE_IN"]
Diff= TF3-(df["TUITIONFEE_IN"].mean())
Diff

dfr3= df.sample()
TF4= dfr3["TUITIONFEE_IN"]
Diff= TF4-(df["TUITIONFEE_IN"].mean())
Diff

dfr4= df.sample()
TF5= dfr4["TUITIONFEE_IN"]
Diff= TF5-(df["TUITIONFEE_IN"].mean())
Diff

# 13 - Randomly select 20 observations, display the tuition of these observation.
dfr20= df.sample(20)
dfr20

# 14 - Calculate and displat the mean tuition of these 20 observations
df_mean20= dfr20["TUITIONFEE_IN"].mean()
df_mean20

# 15 - Calculate the different between the average tuition of these 20 observations
# and the mean tuition off all observations calculated earlier.
# Display the difference
Diff2= df_mean20 -(df["TUITIONFEE_IN"].mean())
Diff2

# 16 - Repeat 13 to 15 several times and get the feel of the size of the difference in means
# Compare these differences with the differences calculated earlier with just 1 observation
# Describe your hunch/conclusion

dfr20= df.sample(20)
df_mean20= dfr20["TUITIONFEE_IN"].mean()
Diff3= df_mean20 -(df["TUITIONFEE_IN"].mean())
Diff3

dfr20= df.sample(20)
df_mean20= dfr20["TUITIONFEE_IN"].mean()
Diff4= df_mean20 -(df["TUITIONFEE_IN"].mean())
Diff4

dfr20= df.sample(20)
df_mean20= dfr20["TUITIONFEE_IN"].mean()
Diff5= df_mean20 -(df["TUITIONFEE_IN"].mean())
Diff5

The diffence in means varries depndent which random sampling of schools is chosen

# 4. Visualization

Write a function that take two input parameters:

- x (number of observations)
- y (the name of the dataframe).

The function will perform the following:

- Randomly select x observations from y
- Calculate the mean of these x observations
- Calculate the mean of all observations
- Calculate the difference between the two means
- Return the difference

# 17 - write function Code here DIFFM is the mean difference using two variables X and Y
def DiffM (x,y):
df_x=y.sample(x)
mean_x=df_x["TUITIONFEE_IN"].mean()
df_mean=df["TUITIONFEE_IN"].mean()
return mean_x-df_mean

# 18 - test the function by passing a sample size and the name of the dataframe
# Dsplay the return of the function Use sample size 10 and the orgiginal data frame to find the differnce in means
DiffM(10,df)

# 19 - Create a list of sequence numbers from 1 to 50, name it "sample_sizes";
# display it to make sure its members are from 1 to 50
sample_sizes= list(range(1, 51))
print(sample_sizes)

# 20 - Create an empty list and name it "means_diff" for differences of population mean and sample means;
# Loop through the list sample_sizes:
# For each element in the sample size list, obtain a random sample of that size from the data frame;
# Calculate the sample mean
# Calculate the difference between the population mean and the sample mean
# Append the difference to the list "means_diff"
# Display the list "means_diff" after the loop is completed
means_diff=[]
for num in sample_sizes:
df_num= df.sample(num)
mean_num=df_num['TUITIONFEE_IN'].mean()
difference=mean_num-(df["TUITIONFEE_IN"].mean())
means_diff.append(difference)
print(means_diff)

# 21 - Make a scatter plot with sample size on the x-axs and mean difference on the y-axis
# Observe as the sample size increases, the sample means converge to the population mean.
# Make sure you make the plot large enough
plt.scatter(sample_sizes,means_diff,plt.gcf().set_size_inches(15, 8))
plt.xlabel('Sample size')
plt.ylabel('Mean difference')
plt.title('Sample Size V.S. Differences in Means')
plt.show()

# 22 - repeat 18 to 21 by replacing 50 with a larger number. For example, 300, or even 1000.
# and see how the plot looks.
sample_sizes2= list(range(1, 301))
print(sample_sizes2)

means_diff=[]
for num in sample_sizes2:
df_num= df.sample(num)
mean_num=df_num['TUITIONFEE_IN'].mean()
difference=mean_num-(df["TUITIONFEE_IN"].mean())
means_diff.append(difference)
print(means_diff)

plt.scatter(sample_sizes2,means_diff,plt.gcf().set_size_inches(15, 8))
plt.xlabel('Sample size')
plt.ylabel('Mean difference')
plt.title('Sample Size V.S. Differences in Means')
plt.show()