Atlantis Datathon 2024
By Nikolaj Kim
This app highlights my process of analyzing the current issue of lack of affordable housing in Orange County utilizing data from Melissa. This notebook utilizes the provided Consumer Data and ZIP Data data sets for this competition.
Motivation
In response to the pressing issue of affordable housing scarcity in Orange County, this project aims to conduct a comprehensive analysis aimed at unraveling the source of the problem. With soaring home prices outpacing income growth and widening the socioeconomic gap, there is a need to understand the factors contributing to this disparity. By utilizing data-driven methodologies and visualization techniques, this notebook seeks to highlight patterns and trends within Orange County's housing landscape. Our ultimate goal is to empower stakeholders and facilitate the development of effective strategies in the right places, ensuring a more equitable future for all residents of Orange County.
Data Exploration
Run to view results
We started by creating a base orange county map using a california_zipcodes.json file from https://github.com/OpenDataDE/State-zip-code-GeoJSON/blob/master/ca_california_zip_codes_geo.min.json.
After trimming down the json file to only contain geographical boundaries for orange county zip codes, we started by mapping consumer distributions by zip code.
Run to view results
Median Household Income by Zip Code in Orange County:
The first step to visualizing what areas are facing the most intense disparities between income and household value is to see what areas are struggling the most financially compared to other areas in Orange County.
Run to view results
We see from this map that the lowest median household incomes in Orange County based on Zip Code mostly seem to be clustered in the north western portion of the county. We should keep our eye on this sector moving forward.
Home value vs Median Household Income / Per Capita Income
The next question we can ask is what is the ratio of the value of homes and household incomes?
Run to view results
When we create this map, we are looking for zip codes with high home value to income ratios. This highlights areas where the value of homes are significantly larger than the median income. The highest disparity seems to come from beach cities, and the next highest cluster of disparity seems to come from the same north western sector of Orange County that we mentioned in the previous map.
From this map, we are able to visualize the ratio of home value to household income within each zip code. To set a bar, let's look at zip codes with a Home Value to Income ratio greater than 10. This leaves us with the following zip codes:
Run to view results
These 53 zip codes have the highest Home Value to Income Ratios among orange county. However, what is the cause of these high ratios and is the cause really the same for all zip codes?
Simplify the Data Pt. 1
Seeing this, we wanted to take a look at a visualization where Home Value to Income ratio and actual median household income was compared for each zip code. This resulted in the use of a scatterplot:
Run to view results
We can see that the main cluster of zip codes with a Home Value to Income ratio is to the left of the example Threshold line of 140k. Although 140k is an arbitrary threshold, we can infer that households with an income higher than this are likely within zip codes that are not suffering from poverty, but rather contain households with very high incomes, causing the high Home Value to Income ratio.
An even better way to visualize this is to use per capita income instead of mean household income. The following scatterplot:
Run to view results
Let's eliminate any zip codes with a per capita income greater than 75k and add city names corresponding to zip codes.
Run to view results
Run to view results
Now we have a visualization of zip codes that specifically have a Home Value to Income ratio greater than 10, and a per capita income value less than $75,000. By making the other zip codes blank, we can really visualize where geographically these particular zip codes that we are interested in are located. Once again, the cluster of zip codes in the north western sector of the county are showing up.
Simplify the Data Pt. 2
While we're at it, let's also consider median age in our analysis.
Run to view results
We can visualize the median age by creating a bar graph.
Run to view results
Here, we see 3 outliers. Zip codes 92617, 90743, 90740.
Zip code 92617 is so low as it is a zip code that contains UCI where a larger percentage of younger people live. Zip code 90743 and 90740 contain senior homes [1], significantly increasing the median age. Individuals in senior homes or full time students are going to bring down the median income/per capita income and drastically increase the home value to income ratio. We will exclude these 3 outliers moving forward.
[1] Leisure World has almost 10,000 residents in its communities. This is over 20% of the population in Seal Beach.https://en.wikipedia.org/wiki/Leisure_World,_Seal_Beach,_California#:~:text=Leisure%20World%20houses%20approximately%209%2C600,two%2Dbedroom%20apartments%20and%20condominiums.
Run to view results
With these zip codes removed, we can recreate our map of zip codes that have a Home Value to Income ratio greater than 10, and a per capita income value less than 75k, along with some extra data cleaning.
Run to view results
The map right above is an updated version of the Home Value to Income Ratio by Zip Code in Orange County map with unnecessary outliers removed.
Data Analysis... Why are things the way they are?
Education
Let's take a look at the topic of education. Are these areas that have such an uneven ratio between income and house value in this situation because of a lack of education?
Run to view results
Note: above it is imperative to drop duplicates. If this is not done, the duplicates will sum themselves and cause yes college rates to reach 500-600 percent sometimes.
Now that we've created a data frame that contains the college attendance rate, lets create a bar graph that displays this data.
Run to view results
As we can see through the bar graph, both zip codes within the inquiry zip code set and outside do not have any major differences. The percentage of the population that went to college varies a similar amount for both groups.
Let's try if high school completion rates change anything:
Run to view results
We see that there is not much disparity between high school completion rates either. This debunks the theory that a lack of education plays into the reasoning of high house values compared to income.
Homeowner/Renter Rates
Lets take a look at if those who live in the inquired zip codes have a differing homeowner/renter rate than those who live in zip codes outside of that.
Lets create a new dataframe called zipcode_data_for_ownership
Run to view results
Lets visualize this with a stacked bar chart.
Run to view results
Once again, it can be argued that there is not a huge difference between the zip codes we are interested in and the remaining zip codes. Although it does seem like the zip codes with lower house value to income ratios can have higher renter rates, there are numerous potential causes for this and is much more of a complex topic.