The big picture
Use California census data to build a model of housing prices in the state.
The model should be able to predict median housing price in any district using other metrics. This is a supervised learning task, because the given dataset is labeled, and regression is an appropriate choice because the goal is to predict a continuous value.
Select a performance measure
Root Mean Square Error (RMSE) is a typical performance measure for regression problems.
RMSE is the L2 norm, whereas MAE is the L1 norm. The higher the norm index, the more it focuses on large values and neglects small ones. This is why RMSE is more sensitive to outliers than MAE. But when outliers are rare (e.g. bell-shaped distribution), RMSE performs very well and is generally preferred.
Get the data
Download
Examine the data
Observations on the data:
Create a test set
We need to create a test set, put it aside, and never look at it.
You can use the train_test_split on multiple datasets with the same number of rows, and it will split them on the same indices. For example, you might have a separate dataframe with the target values.
Stratified sampling
Let's say you speak w/ experts who say that median income is a very important attribute to predict median housing prices. You want to ensure that the test set is representative of various categories of incomes in the whole dataset.
Since median income is a continuous variable, we need to assign categories. Looking at the data above, we see most income values fall between 1.5 and 6 ($15K to $60K), but some go far beyond that. We also need to make sure that we have a sufficient number of examples in each category (not too many categories).
Discover and visualize the data to gain insights
We should start with setting aside the test set to make sure we're only exploring the training data.
Looking for correlations
We can calculate the standard correlation coefficient (also called Pearson's r) between every pair of attributes using corr() method, especially since the dataset is not too large.
Remember that a correlation coefficient close to 1 means a strong positive correlation, and a coefficient close to -1 means there is a strong negative correlation.
We can use scatter_matrix() to visualize correlation between attributes. Since there are 11 numerical columns, this would result in 11^2 = 121 plots, so let's just focus on the most promising ones.
Notice how there seem to be straight lines at $450K, another at $350K, and others. We might need to remove these districts to prevent the algorithm from learning to reproduce these data quirks.
Experimenting w/ attribute combinations
Notice how rooms_per_household is much more correlated w/ median_house_value than the total number of rooms or bedrooms.
This exploration step does not have to be absolutely thorough. Rather, it's an iterative process. Once you have a prototype up and running, you can analyze its output to gain more insights and come back to this exploration step.
Prepare the data for ML algorithms
It's helpful to build functions to automate these steps instead of manually because:
First, we want to create a clean training set by separating the targets. We don't apply transformations to the target.
Data cleaning
SimpleImputer can be used to replace missing values in a dataset. We have missing values for total_bedrooms.
Handling text and categorical attributes
ML algorithms prefer to work with numbers, so we can use OrdinalEncoder to convert categories to numbers.
The problem with using ordinal encoding in this situation is that the ML algorithm will think numbers closer together are similar, whereas that's not necessarily the case here. We can use one-hot encoding to create dummy variables for each category.
The output of OneHotEncoder() object is a sparse array, which saves space by only saving location of nonzero elements. If you want to convert it to regular NumPy array, you can use the toarray() method.
Custom transformers
Often you will need to write your own transformers for tasks such as custom cleanup operations or combining specific attributes. All you need to do is create a class and implement three methods:
You can include the BaseEstimator class to get two extra methods that are useful for automatic hyperparameter tuning: get_params() and set_params().
The transformer we created has a single hyperparameter, add_bedrooms_per_room. You can add hyperparameters to gate any data preparation step you're not sure about. The more you automate here, the more combinations you can automatically try out.
Feature scaling
Generally, ML algorithms do not perform well when numerical attributes have very different scales, which is the case for our housing data.
Two common ways to scale features:
Transformation pipelines
Use Pipeline create and execute a sequence of transformations.
We can use ColumnTransformer to create a single transformer that can handle both numerical and categorical columns together.
Select and train a model
Train and evaluate on the training set
We got it to work, but the predictions are not very accurate. We can measure the RMSE on the whole training set to see how accurate we were.
Most districts' median house values range from $120K to $265K, so a typical prediction error of $68K is not very good. We're definitely underfitting the training data. To fix this, we can:
Let's move on to using DecisionTreeRegressor, which is capable of finding complex nonlinear relationships.
It's unlikely that the model is actually perfect. Rather, we've probably badly overfit the data. How can we be sure? To examine this, we don't want to touch the test set. Rather, we will use part of the training set for training and part of it for model validation.
Better evaluation using cross-validation
We can use train_test_split() to further split our training data into training and validation sets. However, it's easier to use K-fold cross validation. The model is trained K times, each time on K-1 of the folds and evaluated on the single remaining fold.
The cross validation features expects a utility function (greater is better), rather than a cost function (lower is better). So the scoring function is the opposite of MSE (negative), meaning a higher MSE is actually a lower score.
We can see that the Decision Tree model is so badly overfitting that it actually performs worse than the Linear Regression model. We can try a RandomForestRegressor, which works by training many Decision Trees on random subsets of the features and averaging out their predictions (this is an example of ensemble learning).
Random Forests look very promising. Note that the scores on the validation sets are still much worse compared to the training set, meaning that we are probably overfitting the training set. We can:
Before diving too deep into a given model and tweaking hyperparameters, it's a good idea try out many other models. The goal is to shortlist a few (two to five) promising models.
Also, consider using the joblib library to save every model you experiment with so that you can come back easily to any model you want. Make sure to save hyperparameters, trained parameters, as well as the cross-validation scores and perhaps actual predictions. This will allow you to compare scores across model types and compare the types of errors they make.
Fine-tune your model
Grid search
Instead of fiddling with hyperparameters manually, use GridSearchCV, which will take given hyperparameter values and use cross-validation to evaluate all the possible combinations.
Below, we have 3x4 = 12 models to train from the first line, and 2x3 = 6 models to train from the second. This makes 18 models total, and each is trained 5 times because we're using K-fold cross validation. This means we're doing 90 rounds of training (might take a while).
Some information on GridSearchCV:
Randomized search
RandomizedSearchCV can be used instead, and it is preferable when the hyperparameter search space is large. Instead of trying out all possible combinations, it evaluates a given number of random combinations (a random value for each hyperparameter is selected at each iteration).
Ensemble methods
Another way to find tune your system is by combining the models that perform the best. The group/ensemble often performs better than the best individual model (e.g. our Random Forest performed better than individual Decision Trees).
Analyze the best models and their errors
We can drop some of the less useful features (e.g. only one ocean_proximity category is really useful, so you could try dropping the others.
Evaluate your system on the test set
All we have to do is get the predictors and labels from the test set, run full_pipeline to transform the data, and evaluate the final model on the test set.
Exercises
Here are solutions to these exercises.
This is much worse than RandomForestRegressor.
Notice how the best value of C is the maximum tested value. In a case like this, you want to launch the grid search again with higher values of C (removing the smallest values), because it is likely that higher values of C will improve performance.
This is much closer to RandomForestRegressor, but not quite there yet.
This time we find a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time.
Let's look at the exponential distribution we used with scale=1.0. Some samples are much larger or smaller than 1.0, but when you look at the log of the distribution, you can see that most values are actually concentrated roughly in the same range of exp(-2) to exp(+2), which is about 0.1 to 7.4.
The distribution for C looks different. The scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of samples, looks constant. This distribution is useful when you don't have a clue what the target scale is.
The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.
Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a RandomForestRegressor). You may be tempted to compute them directly in the TopFeatureSelector's fit() method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache).
Let's create a new pipeline that runs the previously defined prep pipeline and adds top k feature selection.
The pipeline works, but the predictions are not fantastic. They would probably be better if we used the best RandomForestRegressor that we found earlier, rather than the best SVR.
Note: In the code below, I've set the OneHotEncoder's handle_unknown hyperparameter to 'ignore', to avoid warnings during training. Without this, the OneHotEncoder would default to handle_unknown='error', meaning that it would raise an error when transforming any data containing a category it didn't see during training. If we kept the default, then the GridSearchCV would run into errors during training when evaluating the folds in which not all the categories are in the training set. This is likely to happen since there's only one sample in the 'ISLAND' category, and it may end up in the test set in some of the folds. So some folds would just be dropped by the GridSearchCV, and it's best to avoid that.