# Machine Learning in Python - Workshop 4

## 1. Setup

### 1.1 Packages

In the cell below we will load the core libraries we will be using for this workshop and setting some sensible defaults for our plot size and resolution.

### 1.2 Data

To start we will again be using the same data set from Workshop 3, which was generated via a random draw from a Gaussian Process model. It represent an unknown smooth function $f(x)$ that is observed with noise such that $y_i = f(x_i) + \epsilon_i$. The data have been randomly thinned to include only 100 observations.

We can read the data in from `gp.csv`

and plot the data to see the overall trend in the data,

# 2. Cross validation

In this section we will explore some of the tools that sklearn provides for cross validation for the purpose of model evaluation and selection. The most basic form of CV is to split the data into a testing and training set, this can be achieved using `train_test_split`

from the `model_selection`

submodule. Here we provide the function with our model matrix $X$ and outcome vector $y$ to obtain a test and train split of both.

The additional arguments `test_size`

determines the proportion of data to include in the test set and `random_state`

is the seed used when determining the partition assignments (keeping the seed the same will result in the same partition(s) each time this cell is rerun).

We can check the dimensions of the original and new objects using their shape attributes,

With these new objects we can try several polynomial regression models, with different values of `M`

, and compare their performance. Our goal is to fit the models using the training data and then evaluating their performance using the test data so that we can avoid potential overfitting.

We will assess the models' performance using root mean squared error,

$$ \text{rmse} = \left( \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \right)^{1/2} $$

with sklearn this is calculated using the `mean_squared_error`

function with the argument `squared=False`

. This metric is entirely equivalent to mean squared error for purposes of ranking / ordering models (as the square root is a monotonic transformation) but the rmse is often prefered as it is more interpretable, as it has the same units as $y$.

The following code uses a `for`

loop to fit 30 polynomial regression models with $M = 1,2,\ldots,30$ and calculates the rmse of the training data and the testing data for each model. Note that we are fitting the model *only* using the `train`

split and predicting using both the `train`

and the `test`

split to get the train and test rmses respectively.

We can then plot `degree`

vs `rmse`

for these splits and observe the resulting pattern.

### ♦ Exercise 1

Based on these results, what value of $M$ produces the best model, justify your answer.

*It looks like the test rmse reaches a minimum around a degree of 17. Other large degree values pruduce
rmses that are roughly equivalent, so all else being equal we should select the simpler mode, i.e. the model
with fewer parameters.*

### ♦ Exercise 2

Try adjusting the proportion of the data in the test vs training data, how does this change the Training and Testing rmse curves?

*Changing the proportion does lead to the curves changing, particularly the test rmse, but generally the pattern seems consistent,
with the "best" model having a degree of around 17.*

## 2.1 k-fold cross validation

The basic implementation of k-fold cross validation is provided by `KFold`

in the `model_selection`

submodule of `sklearn`

. This function / object behaves in a similar way to the other `sklearn`

tools we've seen before, we use the function to construct an object with some basic options and the resulting object then can be used to interact with our data / model matrix.

Here we have set up the object `kf`

to implement 5-fold cross validation for our data, which we can then apply using the `split`

method on our model matrix `X`

and response vector `y`

.

This returns a generator which is then used to generate the indexes of the training data and test data for each fold. We can use list comprehension with this generator to view these values.

### ♦ Exercise 3

Examine the index values that make up the test and training sets, do you notice anything intesting about how `KFold`

has divided up the data?

*The fold partitions are sequential, e.g. fold 1 is 0,1,...,19; fold 2 is 20,...,39; and so on.*

### ♦ Exercise 4

Does the structure in Exercise 3 have any implications for model evaluation? Specifically, are the data in the different folds independent of each other, explain why this is important.

*This is problematic for our model as our data is sorted over the x values, as such the folds represent
the first 20 observations, and the next and so on. Given we are modeling a smooth function these observations
and folds will not be independent of each other and this will signficantly affect our ability to accurately fit
a model.*

Using a `for`

loop with the `KFold`

generator it is then possible to create test and train subsets, fit the model of interest and calculate the training and testing metrics but this requires a fair bit of book keeping code to implement, see an example of this here.

However, `sklearn`

provides a more convenient way of doing all of this using the `cross_val_score`

function from the `model_selection`

submodule. This function takes as arguments our model or pipeline (any object implementing `fit`

) and then our full model matrix $X$ and response $y$. The argument `cv`

is a cross-validation generator that determines the splitting strategy (e.g. a `KFold`

object).

Here we have used `"neg_root_mean_squared_error"`

as our scoring metric which returns the negative of the root mean squared error. As the name implies this returns the negative of the usual fit metric, this is because sklearn expects to always optimize for the maximum of a score and the model with the largest negative rmse will therefore be the "best". To get a list of all available scoring metrics for `cross_val_score`

you can run `sorted(sklearn.metrics.SCORERS.keys())`

.

To obtain these 5-fold CV estimates of rmse for our models we slightly modify our original code as follows,

This fits $5 \times 30$ polynomial regression models for and stores the results in the data frame `cv`

. The `mean_rmse`

column contains the average rmse across all 5 folds.

We can now plot these data, to assess the different models and their performance on fitting these data.

The plot above is a bit hard to read and compare, particularly for the lower degree polynomial models. We can address this by using a log scale for the y-axis.

### ♦ Exercise 5

Do these CV rmse's agree with the results we obtained when using `train_test_split`

? How do they differ, is it only a single fold that differs or several?

*Some of the individual folds look similar to the previous results, e.g. fold 1. However, there is a
fair bit of variability between the folds. Overall, the mean rmse seems to indicate models with slightly
fewer degrees perform better vs. our previous results.*

### ♦ Exercise 6

Based on these results, what value of $M$ do you think produces the best model, justify your answer.

*Based on these plots it appears that $M=13$ produces the lowers average rmse, however $M=11$ is very close*

## 2.2 CV Grid Search

We can further reduce the amount of code needed if we wish to test over a specific set of parameter values using cross validation. This is accomplished by using the `GridSearchCV`

function from the `model_selection`

submodule.

This function works similarly to `cross_val_score`

, but with the addition of the `param_grid`

argument. This argument is a dictionary containing parameters names as keys and lists of parameter settings to try as values. Since we are using a pipeline, out parameter name will be the name of the pipeline step, `polynomialfeatures`

, followed by `__`

, and then the parameter name, `degree`

. So for our pipeline the parameter is named `polynomialfeatures__degree`

. If you want to list any models available parameters you can call the `get_params()`

method on the model object, e.g. `m.get_params()`

here.

The above code goes through the process of fitting all $5 \times 30$ models as well as storing and ranking the results for the requested scoring metric(s).

Once all of the submodels are fit, we can determine the optimal hyperparameter value by accessing the object's `best_*`

attributes,

Here we see that this proceedure has selected the model with polynomial degree 13 as having the best fit, and this model achieved an average rmse of $0.196$.

Additional useful details from the CV process are available in the `cv_results_`

attribute, which provides CV and scoring details,

Finally, the `best_estimator_`

attribute gives direct access to the "best" model (pipeline) object. Which allows for direct inspection of model coefficients, make predictions, etc.

For example if we wanted to plot this "best" model's fit to the data we can use the object's predict method directly.

### ♦ Exercise 7

Do these results appear to be "robust"? More specifically, do you think that the degree 13 polynomial model is actually the best? If you were to change the `random_state`

used for the shuffling of the folds, will this result change?

*The $M=13$ model only seems marginally better than other models with slightly fewer degrees, it seems
quite likely that different folds might produce a different but similar answer for the optimal model.*

# 3. Additional dimensions and CV

Let us now consider an additive regression model of the following form,

$$ y = f(x_1) + g(x_2) + h(x_3) + \epsilon $$

where $f()$, $g()$, and $h()$ are polynomials with fix degrees, we will assume linear, quadratic and cubic in this case with the following coefficients:

$$ \begin{aligned} f(x) &= 1.2 x + 1.1 \ g(x) &= 2.5 x^2 - 0.9 x - 3.2 \ h(x) &= 2 x^3 + 0.4 x^2 - 5.2 x + 2.7 \end{aligned} $$

We generate values for $x_1$, $x_2$, $x_3$, and $\epsilon$ and then use these values to calculate observations of $y$ using the following code.

### ♦ Exercise 8

Create a pairs plot of these data, from this alone is it possible to identify the polynomial relationships between $y$ and the $x$s?

*There are no obvious relationships between $x_1$, $x_2$, or $x_3$ but it does appear
that there are relationships between $y$ and $x_1$, $x_2$, and $x_3$ - although they all look
somewhat weak and linear*.

We will assume that we know that each of the functions $f()$, $g()$, and $h()$ are at most of degree 3 - we can try to fit a polynomial model to these data using the tools we've seen thus far.

The following line prints out the model coefficients for the fitted model. While we may have expected only 10 (or 9) parameters in this model, we instead get 20 coefficients for this model.

The reason for this becomes more clear if we examine the `powers_`

attribute which returns lists of the powers of $x_1$, $x_2$, and $x_3$ for each of the model terms. So for example the coeficient $0.282$ belongs to the $x_1^0 \, x_2^0 \, x_3^0$ term (i.e. the model intercept).

### ♦ Exercise 10

Compare the fitted coefficients with the "true" values for our data, how close are they?

*There is almost no correspondence between the true model and our model.*

*The true model is*
$$
\begin{aligned}
f(x) &= 1.2 x + 1.1 \
g(x) &= 2.5 x^2 - 0.9 x - 3.2 \
h(x) &= 2 x^3 + 0.4 x^2 - 5.2 x + 2.7
\end{aligned}
$$

*The fitted model is*
$$
\begin{aligned}
\hat{y} =& 0.282 + \
& 2.327 x_1 - 1.108 x_1^2 + 0.323 x_1^3 + \
& 0.201 x_2 + 0.835 x_2^2 + 0.666 x_2^3 + \
-& 4.559 x_3 - 0.326 x_3^2 + 2.207 x_3^3 + \
-& 1.161 x_1 x_2 -0.984 x_1 x_3 -0.482 x_2 x_3 + \
& 0.463 x_1^2 x_2 + 0.556 x_1^2 x_3 + 0.531 x_1 x_2^2 + 0.348 x_1 x_3^2 + 0.525 x_2^2 x_3 + 0.037 x_2 x_3^2 \
& 0.129 x_1 x_2 x_3
\end{aligned}
$$

*Note that our model has $f(x_1)+ g(x_2)+h(x_3)$, therefore the overall intercept of the model should be:
$1.1-3.2+2.7 = 0.6$*

### ♦ Exercise 11

Calculate the 5-fold cross validation rmse for this model.

## 3.2 Column Transformers

For this particular model we do not want these interaction terms between our features, but this is not something that `sklearn`

allows us to disable for `PolynomialFeatures`

. This is actually a specific example of a more general issue where we do not want to apply a transformer to all of the features of a model at the same time. For this particularly example, we would like to apply an individual polynomial transformation to each of the three $x$. To do this we will use sklearn's `ColumnTransformer`

and the `make_column_transformer`

helper function from the `compose`

submodule.

The arguments for `make_column_transformer`

are tuples of the desired transformer object and the name of the columns that will have that transformer applied. Once constructed the column transfomer works like any other transformer object and can be applied via `fit`

and `transform`

or `fit_transform`

methods.

`ColumnTransformer`

s are like `pipeline`

s but they include a specific column or columns for the transformer to be applied. By using this transformer we take each feature and apply a single polynomial feature transformer, of degree 3 (excluding the intercept column (bias)), resulting in 9 total features as output (3 for each input feature). We can check these values make sense by examining them along with the original values of the $x$s. Here we are using `include_bias=False`

to avoid creating a rank deficient model matrix, which would result if all three polynomial features transforms included the same intercept column.

A `ColumnTransformer`

is like any other transformer and can therefore be included in a pipeline, this enables us to create a pipeline for fitting our desired polynomial regression model (with no interaction terms). Since the polynomial features no longer include an intercept, we can add this back to the model with `fit_intercept=True`

in the linear regression step.

We can examine the fitted values of the coefficients by accessing the `linearregression`

step and its `coef_`

and `intercept_`

attributes.

Instead of directly fitting, we can also use this pipeline with cross validation functions like `cross_val_score`

to obtain a more reliable estimate of our model's rmses.

### ♦ Exercise 12

Is this rmse better or worse than the rmse calculated for the original model that included interactions? Explain why you think this is.

*The average rmse is very slightly better, the extra parameters from the interactions does not appear
to have improved the overall model fit. This makes sense as $x_1$, $x_2$, and $x_3$ are independent.*

## 3.3 Column Transformers & CV Grid Search

Finally we will see if we can come close to recovering the original forms of $f()$, $g()$, and $h()$ using `GridSearchCV`

. This builds on our previous use of this function, but now we need to optimize over the degree parameter of all three of the polynomial feature transformers. We can examine the names of these transforms by examining the `named_transformers_`

attribute associated with the `columntransformer`

,

This gives us the transformer names: `polynomialfeatures-1`

, `polynomialfeatures-2`

, and `polynomialfeatures-3`

which are referenced in the same way by combining the step name with the transformer name and then the parameter name separated by `__`

. As such, the degree parameter for the first transformer will be `columntransformer__polynomialfeatures-1__degree`

. It is also possible to view all of the parameters for a model or pipeline by looking at the keys returns by the `get_params`

method.

To keep the space of parameters being explored reasonable we will restrict the possible value of the degrees parameter to be in $1,\ldots,5$. This may take a little bit as we are now fitting a decently large number of models.

### ♦ Exercise 13

How many models have been fit and scored by `GridSearchCV`

?

*The code as provided actually only fits models with degrees between 1 and 4 so in that case
we are testing $4 \times 4 \times 4$ different degree values, and for each there are 5 folds - so
there are $4^3 * 5 = 320$ models being fit.*

*If we had correctly set the parameter ranges to $1,2,3,4,5$ then there would have been $5^4 = 625$ models fit.*

Once fit, we can determine the optimal parameter value by accessing `grid_search`

's attributes,

### ♦ Exercise 14

Based on these results have we done a good job of recovering the general structure of the functions $f()$, $g()$, and $h()$? e.g. have we correctly recovered the degrees of these functions.

*The recovered model is reasonably close to the true model - $f(x)$ is linear, $g(x)$ and $h(x)$ are cubic.*

We can directly access the properties of the "best" model, according to our scoring method, using the `best_estimator_`

attribute. From this we can access the `linearregression`

step of the pipeline to recover the model coefficients.

### ♦ Exercise 15

Compare the coefficient values we obtained via `GridSearchCV`

to the true values used to generate the $y$ observations, how well have recovered the truth values of the coefficients?

*While not exactly correct, these values are much closer to the true model than our previous attempt.*

*The true model is:*
$$
\begin{aligned}
f(x) &= 1.2 x + 1.1 \
g(x) &= 2.5 x^2 - 0.9 x - 3.2 \
h(x) &= 2 x^3 + 0.4 x^2 - 5.2 x + 2.7
\end{aligned}
$$

*The predicted model is:*
$$
\begin{aligned}
\hat{y} = & 0.547 + \
& 1.269x_1 + \
-& 0.480 x_2 + 1.479 x_2^2 + 0.593 x_2^3 + \
-& 4.954 x_3 + 0.002 x_3^2 + 2.112 x_3^3
\end{aligned}
$$

### ♦ Bonus Exercise

Repeat the analysis above but use only 200 observations of `ex2`

instead of all 500. How does your resulting "best" model change? What about 100 or 50 observations? How dependent are the results on the original sample size?

## 4. Competing the worksheet

At this point you have hopefully been able to complete all the preceeding exercises. Now is a good time to check the reproducibility of this document by restarting the notebook's kernel and rerunning all cells in order.

Once that is done and you are happy with everything, you can then run the following cell
to generate your PDF and turn it in on gradescope under the `mlp-week04`

assignment.