import pandas as pd df = pd.read_csv( 'practice-project-dataset-1.csv' )
df = df[['interest_rate','property_value','state_code','tract_minority_population_percent','derived_race','derived_sex','applicant_age']] df.info()
import numpy as np df['interest_rate'] = df['interest_rate'].replace( 'Exempt', np.nan )
df['interest_rate'] = df['interest_rate'].astype( float )
df['property_value'] = df['property_value'].replace( 'Exempt', np.nan ) df['property_value'] = df['property_value'].astype( float ) df.info()
df['derived_race'] = df['derived_race'].astype( 'category' ) df['derived_sex'] = df['derived_sex'].astype( 'category' ) df['applicant_age'] = df['applicant_age'].astype( 'category' ) df.info()
Now some questions that can arise from the information given from the data is what values have a property value lower than $500,000? Additionally, it is of interest which of these property values under $500,000 have a tract minority population percent greater than 75 or lower than 25.
lower_prices = df[df['property_value'] < 500000] high_minority = lower_prices[lower_prices['tract_minority_population_percent'] > 75] low_minority = lower_prices[lower_prices['tract_minority_population_percent'] < 25]
The questions asked above could be answered with the code above. With the code above we made 3 different data frames. The first one is given the variable name lower_prices and it only contains data from df that has property values less than $500,000. The second data frame is given the variable name high_minority and it only contains data from lower_prices that has tract minority population percentages greater than 75. The third data frame is given the variable name low_minority and it only contains data from lower_prices that has tract minority population percentages less than 25. The is no output from these lines of code since we did not code a line to display each newly created data frame. However, no output is good because this again indicates the code succeeded without error.
A new question that may be asked now is to see a histogram that displays property values and the proportion at each property value of high and low percentages of minorities. This way we would be able to see which property values are seen more commonly with high and low minority percentages. In order to be able to make the histogram, matplotlib.pyplot needs to be imported. We abbreviate the package with plt.
import matplotlib.pyplot as plt plt.hist( [ high_minority['property_value'], low_minority['property_value'] ], bins=20, density=True ) plt.legend( [ 'High % minority', 'Low % minority' ] ) plt.title( 'Sample of 2018 Home Mortgage Applications' ) plt.xlabel( 'Property Value' ) plt.ylabel( 'Proportion' ) plt.show()
Above we get the output of the created histogram. In the above code we selected the columns of property value from the newly created data frames we created called high_minority and low_minority. We then specified the bin number and stated that we wanted probability density to be included. Then we made a legend, and titled the whole graph and its axes. From the displayed histogram it is seen that higher percentages of minority have higher proportions in about the $130,000 to $230,000 range. Higher percentages of minorities are not as likely as the property value rises.
Taken from what we discovered above, we now are interested to see what the mean property values are for the high_minority and low_minority data frames.
The means were calculated and the data frame high_minority, that has minority percentages greater than 75, had a mean property value of $229,579.65 (rounded). The data frame low_minority, that has minority percentages less than 25, had a mean property value of $240,573.25 (rounded). From this we see that the mean property value for data with less than 25% minorities is higher than the mean property value for data with greater than 75% minorities.
from scipy import stats alpha = 0.05 statistic, pvalue = stats.ttest_ind( high_minority['property_value'], low_minority['property_value'], equal_var=False ) pvalue < alpha # reject H_0?
We set the alpha value to 0.05 and calculated the p-value. When inputting the p-value is less than alpha, the output states true. With a p-value less than the alpha value, we can reject the null hypothesis and conclude that the mean property values for the two data frames are significantly different. The mean property value for the data with minority percentages less than 25 is significantly higher than the mean property value for the data with minority percentages higher than 75. We can infer from this now that higher minority percentages may indicate lower property values. (Something to remember is this is for property values less than $500,000)