# Background to osteoarthritis case study

This is day 3 from the 5-day JADS NHS PROMs data science case study. To recap from the previous lectures, we have looked at defining the outcome of knee replacement.

Lecture 1:

- Good outcome for knee replacement Y is measured using difference in Oxford Knee Score (OKS)
- Research has shown that an improvement in OKS score of approx. 30% is relevant (van der Wees 2017). Hence an increase of +14 points is considered a 'good' outcome.
- To account for the ceiling effect, a high final
`t1_oks_score`

is also considered as a good outcome (even if`delta_oks_score`

is smaller than 14)

Lecture 2:

- We have constructed a combined outcome parameter using cut-off points for pain and physical functioning.

# Modeling: regression and linear modeling

## Learning objectives

### Modeling: regression and linear modeling

For this lecture we are going to build various predictors for `t1_eq_vas`

, i.e. the reported quality of life after operation. `t1_eq_vas`

is measured on a scale from 0 to 100.

We are going to use stratefied sampling to ensure we don't introduce sampling bias, using `StratefiedShuffleSplit`

. This is different from the simplest function `train_test_split`

which is a random sampling method. This is generally fine if your dataset is large enough (especially relative to the number of attributes). But if it is not, you run the risk of introducing significant sampling bias.

By the end of this lecture you should know:

- Know how to perform different regressions models in Python using scikit-learn
- Know how to interpret and assess regression models, including the bias-variance trade-of

### Python: Hands-on Machine Learning (2nd edition)

### scikit-learn

# Data preparation in a scikit-learn Pipeline

Previously we have already discussed the various steps in data preparation using pandas. As explained in the documentation of scikit-learn, this may be problematic for one of the following reasons:

Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.

You may want to include the parameters of the preprocessors in a parameter search.

To this purpose, the `ColumnTransformer`

class has been recently added to scikit-learn. The documentation gives an example how to use this for pre-processing mixed types. Historically, `sklearn`

transformers are designed to work with numpy arrays, not with pandas dataframes. You can use `sklearn-pandas`

to bridge this gap or use `ColumnTransformer`

directly on pandas DataFrames. We will use the latter.

## Using ColumnsTransformers and Pipelines

Recalling from the second lecture, we want to perform the following preprocessing per (group of) columns. In case feature requires more than one preprocessing step, the use of `Pipeline`

is recommended.

### Passing 1D or 2D arrays in your `Pipeline`

It is important to remember that `scikit-learn`

can be quite fussy about the difference between passing 1D arrays/series and 2D arrays/dataframes.

For example, the following code will result in an error because `categories`

needs to be a list of lists:

```
enc = OrdinalEncoder(categories=age_band_categories)
enc.fit(df[age_band])
```

The correct code is (brackets!):

```
enc = OrdinalEncoder(categories=[age_band_categories])
enc.fit(df[age_band])
```

### Beware: difference between `OrdinalEncoder`

and `OneHotEncoding`

Using `OrdinalEncoder`

to generate an integer representation of a categorical variable can not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.

Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.

### Writing custom transformers (advanced, see Géron chapter 2)

Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and implement three methods: fit() (returning self), transform(), and fit_transform().

When writing transformers for data preparation, you only need to define `transform()`

. Basically, `ColumnTransformer`

passes only the subset of columns from the original dataframe to the transformer. So when writing your own transformer you don't need to do any subsetting, but you can assume that the `transform()`

method should be applied to the whole dataframe.

## Training and assessing linear models

### Simple regression

Regression of `t1_eq_vas`

~ `t0_eq_vas`

. We don't get our hopes up, since the scatterplot is all over the place:

So this very first, basic model yields an $R^2$ of 0.12 which is not very exciting. Let's do a more robust cross validation using the Mean Square Error (MSE) as our metric

This confirms a simple linear model has little flexibility (and high bias): the scores for the five CV-folds are very similar. Now that we have seen the simplest setup for a univariate lineair regression, let's try to find out which features are the best predictors.

### SelectKBest

For regression tasks, you often want to get a first idea which features contain the most information i.e. are the best predictors. There are various techniques to answer this question, such as stepwise selection. Scikit-learn has various univariate feature selection methods for this purpose. We will use SelectKBest.

Using 10 KBest features, the model performs slightly better with RMSE of 16.1 +/- 0.08

### Regularized linear models: Lasso regression

Construct a Lasso regression with cross-validation, following the example from scikit-learn documentation. Recall that for regularized linear regression a cost function is added. In case of Lasso this is

$$ J(\Theta) = MSE (\Theta) + \alpha\sum\limits*{i=1}^n \mid{\Theta*{i}}\mid$$

The larger $\alpha$, the larger the penalty and hence more coefficients will be set to zero. By default `LassoCV`

tries 100 different values for $\alpha$ so let's plot MSE against $\alpha$:

The Lasso model performs best for lowest $\alpha$

Inspect the relative feature importance by looking at the (absolute) coefficients:

### KNN

### RandomForest regression

*consider this as a sneak preview for the next lecture, where we will use RandomForest for classification*

That looks promising: out-of-the-box RandomForest performs better than KNN (but slightly worse than Lasso). Like Lasso, we can inspect feature importance:

## Discussion & summary

### For discussion

- Which model would you choose to predict
`t1_eq_vas_`

and why? - Reflect on the differences in feature importance between SelectKBest, Lasso coefficients and RandomForest
- See, for example, this discussion

### Summary

We have considered three basic machine learning algorithms:

- (Simple) linear regression
- LASSO
- KNN

A regression is often used as a baseline model, whereas LASSO and KNN are expected to have better performance. An obvious starting strategy, would be to:

- Run a regression to get a baseline performance
- Run LASSO and compare performance
- With the selected variables in your LASSO, run KNN and compare performance