"Selling California": Are House Prices Equivalent to House Values?
Background and Hypothesis
Data Collection
We started by obtaining data from Zillow covering the median prices of houses in ten California metropolitan statistical areas (MSA). We cleaned up the data for median prices by dropping irrelevant categories such as the region ID and size rank columns.
Next, we obtained house values from Zillow for houses for the same ten MSAs and cleaned the data by dropping the same columns we dropped in the median prices table.
Visualizations
We created two scatterplots to compare the trends of median house price and house value for the ten metropolitan statistical areas (MSA) in California. Based on the two visualizations, it can be noted that the MSAs with higher house values over the years also have higher median prices overall.
We converted the house value data values into lists for each MSA, and graphed the values for each MSA against months spanning from January 2000 to January 2022. From the housing values visualizations, we can observe that the house value trends are similar for all the MSAs, with the value increasing until about 2007, then decreasing until 2012, and then steadily increases after that.
For the median prices, we followed the same steps by converting the median price data values into lists for each MSA, and graphed the values for each MSA against months spanning from January 2018 to January 2022.
A/B Testing
We are testing to see if there is any statistically significant difference between housing value and median price of houses in 10 major regions of CA with a significance value of p = 0.05. We are using A/B Testing to determine if it is significant. A/B testing is done by creating a set of data with labels, and then randomizing the labels to create a distribution of the correlated values. Then, the original observed statistic is compared to the distribution of randomized labels.
The null hypothesis is that there is no statistically significant difference between the median prices and house value for the specified region, and the alternate hypothesis is that there is a statistically significant difference between the median prices and house value.
Machine Learning Model
Then we combined the San Francisco and Fresno into one data frame so we could plot data for both cities on the same graph.
Similar to our A/B testing function, we wrote a few functions to help us compare other cities to San Francisco. The function citydatacollect selects one city from the Kaggle data and cleans out any irrelevant values.
The concatenate function takes two cities and makes a data frame with the city name, house price, and year built.