Goal
This notebook showcases an end-to-end machine learning project to build a model that will predict median housing prices in any district in the state of California. I have used the California Housing Prices dataset from the StatLib repository as the primary dataset for this project. The data is based on a dataset from the 1990 California census. The dataset was modified by Aurélien Géron for his book Hand on Machine Learning and is available from his Github repository. Additionally, this notebook is inspired by and uses a lot of techniques that Géron mentions in his book.
Machine Learning Checklist
The very first step before starting any ML project is to come up with a checklist to help scope the project and create achievable milestones. For this project, I go through the below checklist. This notebook is divided into subsections based on this checklist.
Frame the Problem
The goal of framing the problem is to figure out the business objective of building the ML model. This involves analzying how the company plans to use this model and benfit from it. This will help shortlist the model to use and the seleccting the performance metrics to evaluate them.
The goal of this model is to help determine whether to invest your money into real estate in a particular district of California.
The problem can be further narrowed down to the below three cateogries that will help pick the most suitable alogorithm.
Get the Data
The required data is stored in a CSV file comprssed into a TGZ file. Below function downloads the the data from the TGZ file stored in a Github repository.