import pandas as pd df = pd.read_csv( 'practice-project-dataset-1.csv' )
df = df[['interest_rate','property_value','state_code','tract_minority_population_percent','derived_race','derived_sex','applicant_age']] df.info()
import numpy as np df['interest_rate'] = df['interest_rate'].replace( 'Exempt', np.nan )
df['interest_rate'] = df['interest_rate'].astype( float )
df['property_value'] = df['property_value'].replace( 'Exempt', np.nan ) df['property_value'] = df['property_value'].astype( float ) df.info()
df['derived_race'] = df['derived_race'].astype( 'category' ) df['derived_sex'] = df['derived_sex'].astype( 'category' ) df['applicant_age'] = df['applicant_age'].astype( 'category' ) df.info()
Let's take a look at the property's valued below 500,000. Within this group of smaller property values, let's filter the data into two new sets: one including areas with a high minority population and another with areas of low minority population.
lower_prices = df[df['property_value'] < 500000] high_minority = lower_prices[lower_prices['tract_minority_population_percent'] > 75] low_minority = lower_prices[lower_prices['tract_minority_population_percent'] < 25]
There is no output, which confirms that the code ran successfully (or without any errors), so now we can plot a frequency histogram of property value of mortgage applications among the high and low minority populated areas and compare them.
import matplotlib.pyplot as plt plt.hist( [ high_minority['property_value'], low_minority['property_value'] ], bins=20, density=True ) plt.legend( [ 'High % minority', 'Low % minority' ] ) plt.title( 'Sample of 2018 Home Mortgage Applications' ) plt.xlabel( 'Property Value' ) plt.ylabel( 'Proportion' ) plt.show()
The histogram plotted above indicates a successfully executed code. It is apparent that for most given property values, low % minority areas have a similar proportion as high % minority areas. However, high % minority areas tend to have slightly greater proportions than low % minority areas in property values around 200,000 or less, while the opposite tends to be true for property values greater than 200,000, with the exception of the last two bins of property values which are slightly below 500,000.
Now that we know our data has been coded as we desired, let's compute the mean of the property values for the high % minority areas and the low % minority areas.
The output returns the mean of each population. The mean property value of high % minority areas is less than the mean property value of low % minority areas, however not by much. Let's see if there is a statistical significance between these two values.
from scipy import stats alpha = 0.05 statistic, pvalue = stats.ttest_ind( high_minority['property_value'], low_minority['property_value'], equal_var=False ) pvalue < alpha # reject H_0?
Since the output yielded true (assuming the inputs above were done correctly), we have sufficient evidence to reject the null hypothesis. In other words, we can conclude that, based on our data, the population of high % minority areas has a different mean property value than the population of low % minority areas.