Background to osteoarthritis case study
This is day 3 from the 5-day JADS NHS PROMs data science case study. To recap from the previous lectures, we have looked at defining the outcome of knee replacement.
Lecture 1:
- Good outcome for knee replacement Y is measured using difference in Oxford Knee Score (OKS)
- Research has shown that an improvement in OKS score of approx. 30% is relevant (van der Wees 2017). Hence an increase of +14 points is considered a 'good' outcome.
- To account for the ceiling effect, a high final
t1_oks_score
is also considered as a good outcome (even ifdelta_oks_score
is smaller than 14)
Lecture 2:
- We have constructed a combined outcome parameter using cut-off points for pain and physical functioning.
Modeling: regression and linear modeling
Learning objectives
Modeling: regression and linear modeling
For this lecture we are going to build various predictors for t1_eq_vas
, i.e. the reported quality of life after operation. t1_eq_vas
is measured on a scale from 0 to 100.
We are going to use stratefied sampling to ensure we don't introduce sampling bias, using StratefiedShuffleSplit
. This is different from the simplest function train_test_split
which is a random sampling method. This is generally fine if your dataset is large enough (especially relative to the number of attributes). But if it is not, you run the risk of introducing significant sampling bias.
By the end of this lecture you should know:
- Know how to perform different regressions models in Python using scikit-learn
- Know how to interpret and assess regression models, including the bias-variance trade-of
Python: Hands-on Machine Learning (2nd edition)
scikit-learn
Data preparation in a scikit-learn Pipeline
Previously we have already discussed the various steps in data preparation using pandas. As explained in the documentation of scikit-learn, this may be problematic for one of the following reasons:
Incorporating statistics from test data into the preprocessors makes cross-validation scores unreliable (known as data leakage), for example in the case of scalers or imputing missing values.
You may want to include the parameters of the preprocessors in a parameter search.
To this purpose, the ColumnTransformer
class has been recently added to scikit-learn. The documentation gives an example how to use this for pre-processing mixed types. Historically, sklearn
transformers are designed to work with numpy arrays, not with pandas dataframes. You can use sklearn-pandas
to bridge this gap or use ColumnTransformer
directly on pandas DataFrames. We will use the latter.
Using ColumnsTransformers and Pipelines
Recalling from the second lecture, we want to perform the following preprocessing per (group of) columns. In case feature requires more than one preprocessing step, the use of Pipeline
is recommended.
Passing 1D or 2D arrays in your Pipeline
It is important to remember that scikit-learn
can be quite fussy about the difference between passing 1D arrays/series and 2D arrays/dataframes.
For example, the following code will result in an error because categories
needs to be a list of lists:
enc = OrdinalEncoder(categories=age_band_categories)
enc.fit(df[age_band])
The correct code is (brackets!):
enc = OrdinalEncoder(categories=[age_band_categories])
enc.fit(df[age_band])
Beware: difference between OrdinalEncoder
and OneHotEncoding
Using OrdinalEncoder
to generate an integer representation of a categorical variable can not be used directly with all scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.
Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the OneHotEncoder, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.
Writing custom transformers (advanced, see Géron chapter 2)
Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes. You will want your transformer to work seamlessly with Scikit-Learn functionalities (such as pipelines), and since Scikit-Learn relies on duck typing (not inheritance), all you need to do is create a class and implement three methods: fit() (returning self), transform(), and fit_transform().
When writing transformers for data preparation, you only need to define transform()
. Basically, ColumnTransformer
passes only the subset of columns from the original dataframe to the transformer. So when writing your own transformer you don't need to do any subsetting, but you can assume that the transform()
method should be applied to the whole dataframe.
Training and assessing linear models
Simple regression
Regression of t1_eq_vas
~ t0_eq_vas
. We don't get our hopes up, since the scatterplot is all over the place:
So this very first, basic model yields an $R^2$ of 0.12 which is not very exciting. Let's do a more robust cross validation using the Mean Square Error (MSE) as our metric
This confirms a simple linear model has little flexibility (and high bias): the scores for the five CV-folds are very similar. Now that we have seen the simplest setup for a univariate lineair regression, let's try to find out which features are the best predictors.
SelectKBest
For regression tasks, you often want to get a first idea which features contain the most information i.e. are the best predictors. There are various techniques to answer this question, such as stepwise selection. Scikit-learn has various univariate feature selection methods for this purpose. We will use SelectKBest.
Using 10 KBest features, the model performs slightly better with RMSE of 16.1 +/- 0.08
Regularized linear models: Lasso regression
Construct a Lasso regression with cross-validation, following the example from scikit-learn documentation. Recall that for regularized linear regression a cost function is added. In case of Lasso this is
$$ J(\Theta) = MSE (\Theta) + \alpha\sum\limits{i=1}^n \mid{\Theta{i}}\mid$$
The larger $\alpha$, the larger the penalty and hence more coefficients will be set to zero. By default LassoCV
tries 100 different values for $\alpha$ so let's plot MSE against $\alpha$:
The Lasso model performs best for lowest $\alpha$
Inspect the relative feature importance by looking at the (absolute) coefficients:
KNN
RandomForest regression
consider this as a sneak preview for the next lecture, where we will use RandomForest for classification
That looks promising: out-of-the-box RandomForest performs better than KNN (but slightly worse than Lasso). Like Lasso, we can inspect feature importance:
Discussion & summary
For discussion
- Which model would you choose to predict
t1_eq_vas_
and why? - Reflect on the differences in feature importance between SelectKBest, Lasso coefficients and RandomForest
- See, for example, this discussion
Summary
We have considered three basic machine learning algorithms:
- (Simple) linear regression
- LASSO
- KNN
A regression is often used as a baseline model, whereas LASSO and KNN are expected to have better performance. An obvious starting strategy, would be to:
- Run a regression to get a baseline performance
- Run LASSO and compare performance
- With the selected variables in your LASSO, run KNN and compare performance