The big picture

Use California census data to build a model of housing prices in the state.

The model should be able to predict median housing price in any district using other metrics. This is a supervised learning task, because the given dataset is labeled, and regression is an appropriate choice because the goal is to predict a continuous value.

Specifically, it's a multiple regression problem since there are multiple metrics we're using to predict housing value.

Also, it's a univariate regression task, since we're predicting a single value. If we were predicting multiple values, this would be considered multivariate regression.

Select a performance measure

Root Mean Square Error (RMSE) is a typical performance measure for regression problems.

import IPython.display as display display.Image('/work/figures/rmse.png')

RMSE is the L2 norm, whereas MAE is the L1 norm. The higher the norm index, the more it focuses on large values and neglects small ones. This is why RMSE is more sensitive to outliers than MAE. But when outliers are rare (e.g. bell-shaped distribution), RMSE performs very well and is generally preferred.

display.Image('/work/figures/mae.png')

display.Image('/work/figures/norms.png')

Get the data

Download

import os import tarfile import urllib import pandas as pd import numpy as np import matplotlib.pyplot as plt import joblib %matplotlib inline DOWNLOAD_ROOT = 'https://raw.githubusercontent.com/ageron/handson-ml2/master/' HOUSING_PATH = os.path.join('datasets', 'housing') HOUSING_URL = DOWNLOAD_ROOT + 'datasets/housing/housing.tgz' def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH): os.makedirs(housing_path, exist_ok=True) tgz_path = os.path.join(housing_path, 'housing.tgz') urllib.request.urlretrieve(housing_url, tgz_path) housing_tgz = tarfile.open(tgz_path) housing_tgz.extractall(path=housing_path) housing_tgz.close() def load_housing_data(housing_path=HOUSING_PATH): csv_path = os.path.join(housing_path, 'housing.csv') return pd.read_csv(csv_path) fetch_housing_data()

Examine the data

housing = load_housing_data() housing.head()

housing.info()

housing['ocean_proximity'].value_counts()

housing.describe()

housing.hist(bins=50, figsize=(20, 15)) plt.show() # Technically, calling show() here is optional, Jupyter will automatically display plots

Observations on the data:

Median income is not expressed in USD, but actually tens of thousands of dollars. It's capped between .4999 and 15.0001.

Housing median age and median house value are also capped. The latter could be a problem, especially because our aim to is to predict home prices. Either observations at $500K should be removed from the dataset, or more accurate labels should be collected for homes valued over $500K.

These attributes have very different scales. We can implement feature scaling to fix this.

Many of these distributions are tail-heavy. We can transform them to have more bell-shaped distributions.

Create a test set

We need to create a test set, put it aside, and never look at it.

You can use the train_test_split on multiple datasets with the same number of rows, and it will split them on the same indices. For example, you might have a separate dataframe with the target values.

from sklearn.model_selection import train_test_split train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

Stratified sampling

Let's say you speak w/ experts who say that median income is a very important attribute to predict median housing prices. You want to ensure that the test set is representative of various categories of incomes in the whole dataset.

Since median income is a continuous variable, we need to assign categories. Looking at the data above, we see most income values fall between 1.5 and 6 ($15K to $60K), but some go far beyond that. We also need to make sure that we have a sufficient number of examples in each category (not too many categories).

housing['income_cat'] = pd.cut(housing['median_income'], bins=[0, 1.5, 3, 4.5, 6, np.inf], labels=[1, 2, 3, 4, 5]) housing['income_cat'].hist()

# Stratified sampling from sklearn.model_selection import StratifiedShuffleSplit split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) for train_index, test_index in split.split(housing, housing['income_cat']): strat_train_set = housing.loc[train_index] strat_test_set = housing.loc[test_index]

# Checking if stratified sampling works. All datasets should have same distribution of income categories strat_test_set['income_cat'].value_counts()/len(strat_test_set)

strat_train_set['income_cat'].value_counts()/len(strat_train_set)

housing['income_cat'].value_counts()/len(housing)

# Remove income_cat from all datasets for set_ in (strat_train_set, strat_test_set): set_.drop('income_cat', axis=1, inplace=True)

Discover and visualize the data to gain insights

We should start with setting aside the test set to make sure we're only exploring the training data.

housing = strat_train_set.copy() housing.plot(kind='scatter', x='longitude', y='latitude')

# Set a value for alpha to see density patterns more clearly housing.plot(kind='scatter', x='longitude', y='latitude', alpha=0.1)

# Using color to incorporate population and pricing information housing.plot( kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"] / 100, label="population", figsize=(10, 7), c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True, ) plt.legend()

Looking for correlations

We can calculate the standard correlation coefficient (also called Pearson's r) between every pair of attributes using corr() method, especially since the dataset is not too large.

corr_matrix = housing.corr() corr_matrix['median_house_value'].sort_values(ascending=False)

Remember that a correlation coefficient close to 1 means a strong positive correlation, and a coefficient close to -1 means there is a strong negative correlation.

Note that Pearson's r only measures linear correlations, so it misses out on nonlinear relationships.

display.Image('/work/figures/corr.png')

We can use scatter_matrix() to visualize correlation between attributes. Since there are 11 numerical columns, this would result in 11^2 = 121 plots, so let's just focus on the most promising ones.

from pandas.plotting import scatter_matrix attributes = [ "median_house_value", "median_income", "total_rooms", "housing_median_age", ] scatter_matrix(housing[attributes], figsize=(12, 8))

# The most promising is median income, so let's zoom in on that one housing.plot(kind='scatter', x='median_income', y='median_house_value', alpha=0.1)

Notice how there seem to be straight lines at $450K, another at $350K, and others. We might need to remove these districts to prevent the algorithm from learning to reproduce these data quirks.

The straight line at $500K is the price cap present in the dataset.

Experimenting w/ attribute combinations

# Total rooms or total bedrooms independently are not that helpful housing['rooms_per_household'] = housing['total_rooms']/housing['households'] housing['bedrooms_per_room'] = housing['total_bedrooms']/housing['total_rooms'] # This might be interesting to look at housing['population_per_household'] = housing['population']/housing['households']

# Let's look at an updated correlation matrix corr_matrix = housing.corr() corr_matrix['median_house_value'].sort_values(ascending=False)

Notice how rooms_per_household is much more correlated w/ median_house_value than the total number of rooms or bedrooms.

This exploration step does not have to be absolutely thorough. Rather, it's an iterative process. Once you have a prototype up and running, you can analyze its output to gain more insights and come back to this exploration step.

Prepare the data for ML algorithms

It's helpful to build functions to automate these steps instead of manually because:

You can reproduce these transformations for any dataset.

You will build a library of transformation functions that can be reused for future projects.

It will be easier to try various transformations and see which work best.

First, we want to create a clean training set by separating the targets. We don't apply transformations to the target.

housing = strat_train_set.drop('median_house_value', axis=1) housing_labels = strat_train_set['median_house_value'].copy()

Data cleaning

SimpleImputer can be used to replace missing values in a dataset. We have missing values for total_bedrooms.

Remember, this same exact median must be used to impute missing values in the test set.

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') # Since ocean_proximity is categorical, we can't calculate it's median housing_num = housing.drop('ocean_proximity', axis=1) imputer.fit(housing_num) imputer.statistics_ # run this to see the calculated medians X = imputer.transform(housing_num) # Result of this is a NumPy array, not DataFrame housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

Handling text and categorical attributes

housing_cat = housing[['ocean_proximity']] housing_cat.head(10)

ML algorithms prefer to work with numbers, so we can use OrdinalEncoder to convert categories to numbers.

from sklearn.preprocessing import OrdinalEncoder ordinal_encoder = OrdinalEncoder() housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) housing_cat_encoded[:10]

ordinal_encoder.categories_

The problem with using ordinal encoding in this situation is that the ML algorithm will think numbers closer together are similar, whereas that's not necessarily the case here. We can use one-hot encoding to create dummy variables for each category.

from sklearn.preprocessing import OneHotEncoder cat_encoder = OneHotEncoder() housing_cat_1hot = cat_encoder.fit_transform(housing_cat) housing_cat_1hot

The output of OneHotEncoder() object is a sparse array, which saves space by only saving location of nonzero elements. If you want to convert it to regular NumPy array, you can use the toarray() method.

housing_cat_1hot.toarray()

cat_encoder.categories_

Custom transformers

Often you will need to write your own transformers for tasks such as custom cleanup operations or combining specific attributes. All you need to do is create a class and implement three methods:

fit() (returning self)

transform()

fit_transform() (You can include TransformerMixin as a base class to cover this)

You can include the BaseEstimator class to get two extra methods that are useful for automatic hyperparameter tuning: get_params() and set_params().

# For example, here is a small transformer class that adds the combined attributes from sklearn.base import BaseEstimator, TransformerMixin # Columns indexes for the following variables rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): def __init__(self, add_bedrooms_per_room=True): self.add_bedrooms_per_room = add_bedrooms_per_room def fit(self, X, y=None): return self # nothing else to do def transform(self, X, y=None): rooms_per_household = X[:, rooms_ix] / X[:, households_ix] population_per_household = X[:, population_ix] / X[:, households_ix] if self.add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] return np.c_[ X, rooms_per_household, population_per_household, bedrooms_per_room ] else: return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) housing_extra_attribs = attr_adder.transform(housing.values)

The transformer we created has a single hyperparameter, add_bedrooms_per_room. You can add hyperparameters to gate any data preparation step you're not sure about. The more you automate here, the more combinations you can automatically try out.

Feature scaling

Generally, ML algorithms do not perform well when numerical attributes have very different scales, which is the case for our housing data.

Remember, scaling target values is generally not required.

Two common ways to scale features:

min-max scaling (aka normalization) --> values are shifted and rescaled so they range from 0 to 1 (subtract the min value and divide by the max minus the min).

standardization --> subtract the mean value (standardized values always have a zero mean), and then divide by standard deviation so that the resulting distribution has a variance/SD of one. Standardization does not bound values to a specific range, which may be a problem for certain algos (e.g. neural networks often expect input ranging from 0 to 1), but it is also much less affected by outliers.

Transformation pipelines

Use Pipeline create and execute a sequence of transformations.

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()) ]) housing_num_tr = num_pipeline.fit_transform(housing_num)

We can use ColumnTransformer to create a single transformer that can handle both numerical and categorical columns together.

from sklearn.compose import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ['ocean_proximity'] full_pipeline = ColumnTransformer([ ('num', num_pipeline, num_attribs), ('cat', OneHotEncoder(handle_unknown='ignore'), cat_attribs) ]) housing_prepared = full_pipeline.fit_transform(housing)

Select and train a model

Train and evaluate on the training set

# Directory to store models os.makedirs('models', exist_ok=True)

from sklearn.linear_model import LinearRegression # lin_reg = LinearRegression() # lin_reg.fit(housing_prepared, housing_labels) # joblib.dump(lin_reg, 'models/lin_reg.joblib') lin_reg = joblib.load('models/lin_reg.joblib')

# Let's try it out on a few instances from the training set some_data = housing.iloc[:5] some_labels = housing_labels.iloc[:5] some_data_prepared = full_pipeline.transform(some_data) print('Predictions:', lin_reg.predict(some_data_prepared))

print('Labels:', list(some_labels))

We got it to work, but the predictions are not very accurate. We can measure the RMSE on the whole training set to see how accurate we were.

from sklearn.metrics import mean_squared_error housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels, housing_predictions) lin_rmse = np.sqrt(lin_mse) lin_rmse

Most districts' median house values range from $120K to $265K, so a typical prediction error of $68K is not very good. We're definitely underfitting the training data. To fix this, we can:

Select a more powerful model.

Feed the training algorithm with better features.

Reduce the constraints on the model (we haven't regularized, so this is not an option).

Let's move on to using DecisionTreeRegressor, which is capable of finding complex nonlinear relationships.

from sklearn.tree import DecisionTreeRegressor # tree_reg = DecisionTreeRegressor() # tree_reg.fit(housing_prepared, housing_labels) # joblib.dump(tree_reg, 'models/tree_reg.joblib') tree_reg = joblib.load('models/tree_reg.joblib')

housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) tree_rmse

It's unlikely that the model is actually perfect. Rather, we've probably badly overfit the data. How can we be sure? To examine this, we don't want to touch the test set. Rather, we will use part of the training set for training and part of it for model validation.

Better evaluation using cross-validation

We can use train_test_split() to further split our training data into training and validation sets. However, it's easier to use K-fold cross validation. The model is trained K times, each time on K-1 of the folds and evaluated on the single remaining fold.

The cross validation features expects a utility function (greater is better), rather than a cost function (lower is better). So the scoring function is the opposite of MSE (negative), meaning a higher MSE is actually a lower score.

from sklearn.model_selection import cross_val_score scores = cross_val_score( tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10 ) tree_rmse_scores = np.sqrt(-scores)

# Compare the results def display_scores(scores): print('Scores:', scores) print('Mean:', scores.mean()) print('Standard Deviation:', scores.std()) display_scores(tree_rmse_scores)

lin_scores = cross_val_score( lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10 ) lin_rmse_scores = np.sqrt(-lin_scores) display_scores(lin_rmse_scores)

We can see that the Decision Tree model is so badly overfitting that it actually performs worse than the Linear Regression model. We can try a RandomForestRegressor, which works by training many Decision Trees on random subsets of the features and averaging out their predictions (this is an example of ensemble learning).

from sklearn.ensemble import RandomForestRegressor # forest_reg = RandomForestRegressor() # forest_reg.fit(housing_prepared, housing_labels) # joblib.dump(forest_reg, 'models/forest_reg.joblib') forest_reg = joblib.load('models/forest_reg.joblib') housing_predictions = forest_reg.predict(housing_prepared) forest_mse = mean_squared_error(housing_labels, housing_predictions) forest_rmse = np.sqrt(forest_mse) forest_rmse

# Using cross validation forest_scores = cross_val_score( forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10 ) forest_rmse_scores = np.sqrt(-forest_scores) display_scores(forest_rmse_scores)

Random Forests look very promising. Note that the scores on the validation sets are still much worse compared to the training set, meaning that we are probably overfitting the training set. We can:

Simply the model.

Constrain it (regularize).

Get a lot more training data.

Before diving too deep into a given model and tweaking hyperparameters, it's a good idea try out many other models. The goal is to shortlist a few (two to five) promising models.

Also, consider using the joblib library to save every model you experiment with so that you can come back easily to any model you want. Make sure to save hyperparameters, trained parameters, as well as the cross-validation scores and perhaps actual predictions. This will allow you to compare scores across model types and compare the types of errors they make.

Fine-tune your model

Grid search

Instead of fiddling with hyperparameters manually, use GridSearchCV, which will take given hyperparameter values and use cross-validation to evaluate all the possible combinations.

When you don't know what value a hyperparameter should have, a simple approach is to try consecutive powers of 10.

Below, we have 3x4 = 12 models to train from the first line, and 2x3 = 6 models to train from the second. This makes 18 models total, and each is trained 5 times because we're using K-fold cross validation. This means we're doing 90 rounds of training (might take a while).

from sklearn.model_selection import GridSearchCV # param_grid = [ # {"n_estimators": [3, 10, 30], "max_features": [2, 4, 6, 8]}, # {"bootstrap": [False], "n_estimators": [3, 10], "max_features": [2, 3, 4]}, # ] # forest_reg = RandomForestRegressor() # grid_search = GridSearchCV( # forest_reg, # param_grid, # cv=5, # scoring="neg_mean_squared_error", # return_train_score=True, # ) # grid_search.fit(housing_prepared, housing_labels) # joblib.dump(grid_search, 'models/forest_reg_grid_search.joblib') grid_search = joblib.load('models/forest_reg_grid_search.joblib')

grid_search.best_params_

grid_search.best_estimator_

cvres = grid_search.cv_results_ for mean_score, params in zip(cvres['mean_test_score'], cvres['params']): print(np.sqrt(-mean_score), params)

Some information on GridSearchCV:

Since 6 and 30 are the close to the maximum values that were evaluated, you should try searching again with higher values. The score might continue to improve.

If GridSearchCV is initialized with refit=True (this is default), then once it finds the best estimator using cross-validation, it retrains it on the whole training set. This is a good idea since feeding it more data will likely improve its performance.

You can use GridSearchCV to see if certain data preparation steps are good or not by treating them as hyperparameters. For example, using the add_bedrooms_per_room hyperparameter of the CombinedAttributesAdder transformer we made. Or it can be used to find the best way to handle outliers, missing features, feature selection, etc.

Randomized search

RandomizedSearchCV can be used instead, and it is preferable when the hyperparameter search space is large. Instead of trying out all possible combinations, it evaluates a given number of random combinations (a random value for each hyperparameter is selected at each iteration).

By setting number of iterations, you have more control of computing budget you want to allocate to hyperparameter search.

Ensemble methods

Another way to find tune your system is by combining the models that perform the best. The group/ensemble often performs better than the best individual model (e.g. our Random Forest performed better than individual Decision Trees).

Analyze the best models and their errors

# RandomForestRegressor can indicate relative importance of each feature feature_importances = grid_search.best_estimator_.feature_importances_ feature_importances

# Display scores next to attribute names extra_attribs = ['rooms_per_hhold', 'pop_per_hhold', 'bedrooms_per_room'] cat_encoder = full_pipeline.named_transformers_['cat'] cat_one_hot_attribs = list(cat_encoder.categories_[0]) attributes = num_attribs + extra_attribs + cat_one_hot_attribs sorted(zip(feature_importances, attributes), reverse=True)

We can drop some of the less useful features (e.g. only one ocean_proximity category is really useful, so you could try dropping the others.

Evaluate your system on the test set

All we have to do is get the predictors and labels from the test set, run full_pipeline to transform the data, and evaluate the final model on the test set.

Use transform(), NOT fit_transform() since we don't want to fit the test set.

final_model = grid_search.best_estimator_ X_test = strat_test_set.drop('median_house_value', axis=1) y_test = strat_test_set['median_house_value'].copy() X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared) final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse) final_rmse

# How precise is this estimate? Compute a 95% confience interval for generalization error from scipy import stats confidence = 0.95 squared_errors = (final_predictions - y_test) ** 2 np.sqrt(stats.t.interval( confidence, len(squared_errors) - 1, loc=squared_errors.mean(), scale=stats.sem(squared_errors) ))

Exercises

Here are solutions to these exercises.

Question: Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best SVR predictor perform?

from sklearn.svm import SVR # param_grid = [ # {'kernel':['linear'], 'C':[10, 30, 100, 300, 1000, 3000, 10000, 30000]}, # {'kernel':['rbf'], 'C':[1.0, 3.0, 10., 30., 100., 300., 1000.0], # 'gamma':[0.01, 0.03, 0.1, 0.3, 1.0, 3.0]} # ] # svm_reg = SVR() # grid_search = GridSearchCV(svm_reg, param_grid, cv=5, scoring='neg_mean_squared_error', verbose=2) # grid_search.fit(housing_prepared, housing_labels) # joblib.dump(grid_search, 'models/svm_reg_grid_search.joblib') grid_search = joblib.load('models/svm_reg_grid_search.joblib')

negative_mse = grid_search.best_score_ rmse = np.sqrt(-negative_mse) rmse

This is much worse than RandomForestRegressor.

grid_search.best_params_

Notice how the best value of C is the maximum tested value. In a case like this, you want to launch the grid search again with higher values of C (removing the smallest values), because it is likely that higher values of C will improve performance.

Question: Try replacing GridSearchCV with RandomizedSearchCV.

from sklearn.model_selection import RandomizedSearchCV from scipy.stats import expon, reciprocal # see https://docs.scipy.org/doc/scipy/reference/stats.html # for `expon()` and `reciprocal()` documentation and more probability distribution functions. # param_distribs = { # "kernel": ["linear", "rbf"], # "C": reciprocal(20, 200000), # "gamma": expon(scale=1.0), # } # svm_reg = SVR() # rnd_search = RandomizedSearchCV( # svm_reg, # param_distributions=param_distribs, # n_iter=50, # cv=5, # scoring="neg_mean_squared_error", # verbose=2, # random_state=42, # ) # rnd_search.fit(housing_prepared, housing_labels) # joblib.dump(rnd_search, 'models/svm_reg_rnd_search.joblib') rnd_search = joblib.load('models/svm_reg_rnd_search.joblib')

# Show the best score negative_mse = rnd_search.best_score_ rmse = np.sqrt(-negative_mse) rmse

This is much closer to RandomForestRegressor, but not quite there yet.

# Check the best hyperparameters rnd_search.best_params_

This time we find a good set of hyperparameters for the RBF kernel. Randomized search tends to find better hyperparameters than grid search in the same amount of time.

Let's look at the exponential distribution we used with scale=1.0. Some samples are much larger or smaller than 1.0, but when you look at the log of the distribution, you can see that most values are actually concentrated roughly in the same range of exp(-2) to exp(+2), which is about 0.1 to 7.4.

expon_distribution = expon(scale=1.0) samples = expon_distribution.rvs(10000, random_state=42) plt.figure(figsize=(10, 4)) plt.subplot(121) plt.title('Exponential distribution (scale=1.0)') plt.hist(samples, bins=50) plt.subplot(122) plt.title('Log of this distribution') plt.hist(np.log(samples), bins=50) plt.show()

The distribution for C looks different. The scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of samples, looks constant. This distribution is useful when you don't have a clue what the target scale is.

reciprocal_distribution = reciprocal(20, 200000) samples=reciprocal_distribution.rvs(10000, random_state=42) plt.figure(figsize=(10, 4)) plt.subplot(121) plt.title("Reciprocal distribution (scale=1.0)") plt.hist(samples, bins=50) plt.subplot(122) plt.title("Log of this distribution") plt.hist(np.log(samples), bins=50) plt.show()

The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.

Question: Try adding a transformer in the preparation pipeline to select only the most important attributes.

def indices_of_top_k(arr, k): return np.sort(np.argpartition(np.array(arr), -k)[-k:]) class TopFeatureSelector(BaseEstimator, TransformerMixin): def __init__(self, feature_importances, k): self.feature_importances = feature_importances self.k = k def fit(self, X, y=None): self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k) return self def transform(self, X): return X[:, self.feature_indices_]

Note: this feature selector assumes that you have already computed the feature importances somehow (for example using a RandomForestRegressor). You may be tempted to compute them directly in the TopFeatureSelector's fit() method, however this would likely slow down grid/randomized search since the feature importances would have to be computed for every hyperparameter combination (unless you implement some sort of cache).

k = 5 top_k_feature_indices = indices_of_top_k(feature_importances, k) top_k_feature_indices

np.array(attributes)[top_k_feature_indices]

# Double check that these are indeed the top k features sorted(zip(feature_importances, attributes), reverse=True)[:k]

Let's create a new pipeline that runs the previously defined prep pipeline and adds top k feature selection.

preparation_and_feature_selection_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, k)) ]) housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing) housing_prepared_top_k_features[:3]

# Double check that these are indeed the top k features housing_prepared[0:3, top_k_feature_indices]

Question: Try creating a single pipeline that does the full data preparation plus the final prediction.

prepare_select_and_predict_pipeline = Pipeline([ ('preparation', full_pipeline), ('feature_selection', TopFeatureSelector(feature_importances, k)), ('svm_reg', SVR(**rnd_search.best_params_)) ]) prepare_select_and_predict_pipeline.fit(housing, housing_labels)

# Try the full pipeline on a few instances some_data = housing.iloc[:4] some_labels = housing_labels.iloc[:4] print("Predictions:\t", prepare_select_and_predict_pipeline.predict(some_data)) print("Labels:\t\t", list(some_labels))

The pipeline works, but the predictions are not fantastic. They would probably be better if we used the best RandomForestRegressor that we found earlier, rather than the best SVR.

Question: Automatically explore some preparation options using GridSearchCV.

Note: In the code below, I've set the OneHotEncoder's handle_unknown hyperparameter to 'ignore', to avoid warnings during training. Without this, the OneHotEncoder would default to handle_unknown='error', meaning that it would raise an error when transforming any data containing a category it didn't see during training. If we kept the default, then the GridSearchCV would run into errors during training when evaluating the folds in which not all the categories are in the training set. This is likely to happen since there's only one sample in the 'ISLAND' category, and it may end up in the test set in some of the folds. So some folds would just be dropped by the GridSearchCV, and it's best to avoid that.

# full_pipeline.named_transformers_["cat"].handle_unknown = "ignore" # param_grid = [ # { # "preparation__num__imputer__strategy": ["mean", "median", "most_frequent"], # "feature_selection__k": list(range(1, len(feature_importances) + 1)), # } # ] # grid_search_prep = GridSearchCV( # prepare_select_and_predict_pipeline, # param_grid, # cv=5, # scoring="neg_mean_squared_error", # verbose=2, # ) # grid_search_prep.fit(housing, housing_labels) # joblib.dump(grid_search_prep, "models/grid_search_prep.joblib") grid_search_prep = joblib.load('models/grid_search_prep.joblib') grid_search_prep

grid_search_prep.best_params_

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}The big picture