# TODO

## From feedback

~~In the ridge section, you say that lambda is chosen by considering R^2, but in the code is seems like it might be RSME which is used. Could you perhaps be more clear here?~~- For ridge, the plot of the mean RMSE score as a function of λ is very flat - could that indicate a mistake?
- Not sure what to do here

- The standard errors used in the plot should be the standard error of the mean, that has to be calculated.
~~The text and numbers on the plots are a bit small, and titles etc. would be nice.~~- It was a bit hard to follow the results-section, would be nice if you elaborate a bit more there
- Added a bit more

- It is hard to follow this part of the introduction: "Our goal is to look at the performance of different linear models on this dataset, but because of the size of the dataset shrinkage methods will not yield any benefits. We have therefore opted to split the dataset 5%/95% into a training and testing set." (Why would not shrinkage yield benefits? What is the connection to the choice of train/test-split?)
- Have added a bit more detail here

- The train/test split of 95/5 gives a very small test set, perhaps consider using a larger test set (or explain why 95/5 was chosen)
- Similar to above

## Other

~~Resolve font config issues:~~`findfont: Font family ['normal'] not found. Falling back to DejaVu Sans`

- Verify lambda in plots not disappearing in exported pdf
- Make sure to rerun everything before exporting, current cell ordering is non-linear

# Insurance price analysis

## Introduction

This is an analysis of medical insurance cost in the United States.
We are using a dataset from *Machine Learning with R* by Brett Lantz.
The dataset contains 1338 rows with 6 covariates with no missing data.
The dataset is availiable at https://github.com/stedy/Machine-Learning-with-R-datasets

**age:**The age of the primary beneficiary, 18+**sex:**The gender of the insurance contractor, male/female**bmi:**Body mass index of the primary beneficiary**children:**Number of children covered by the insurance**smoker:**Is the primary beneficiary a smoker? Yes/no**region:**Beneficiary's residential location in the US. Northeast/northwest/southeast/southwest

Using the covariates above we will aim to predict the yearly **charges** by the insurance company in USD.
Our goal is to look at the performance of different linear models on this dataset.
As the size of the dataset is quite large, we do not expect shrinkage methods to outperform, as the best linear solution is likely to be over-determined.
We have therefore opted to split the dataset 5%/95% into a training and testing set.
By using less training data, we increase the probability of overfitting for models without shrinkage, which should give an advantage to models with shrinkage. In preliminary results we found that there was no benefit to using shrinkage when we used a larger portion of the dataset. We opted to split the dataset this way to acieve more instructive result
We will only be using the testing set at the very end to see how the different models perform.

We can see that there is relatively little pairwise correlation between the continous covariates. This plot would not reveal correlation between categorical and continous covariates, however.

## Analysis

We will be using 4 different linear models to predict insurance charges.

**Multiple Linear Regression:**First we will set a baseline performance with MLP.$\hat{\beta} = \arg\min_{\beta} \frac{1}{N}\lVert\mathbf{y} - X\beta\rVert_2^2$

**Ridge regression:**$\hat{\beta} = \arg\min_{\beta} \frac{1}{N}\lVert\mathbf{y} - X\beta\rVert_2^2 + \lambda\lVert \beta \rVert_2^2$

**Lasso Regression:**$\hat{\beta} = \arg\min_{\beta} \frac{1}{N}\lVert\mathbf{y} - X\beta\rVert_2^2 + \lambda\lVert \beta \rVert_1$

**Group Lasso Regression:**$\hat{\beta} = \arg\min

*{\beta} \frac{1}{N}\lVert\mathbf{y} - \sum*{j=1}^J X*j\beta_j\rVert_2^2 + \lambda \sum*{j=1}^J\lVert \beta*j \rVert*{K*j}, \qquad \lVert z\rVert*{K_j} = (z^TK_jz)^{1/2}$

We will follow the same procedure for all the shrinkage methods with respect finding optimal parameters and coefficient uncertainty. We use cross validation to find the optimal regularization parameter for each method. We use this optimization scheme multiple times with different bootstrapping samples to estimate the uncertainty of the coefficients.

## Multiple linear regression

We start by fitting a baseline multiple linear regression model, and plotting the residuals.

From the plot, it does not look like the residuals are normally distributed around 0, so our linear model assumption does not seem to hold. We anyway try to fit a linear model to the dataset and look at what performance we can achieve.

## Ridge

Now let's try Ridge regression.

We find the optimal value of $\lambda$ by grid search in log space. We use cross validation with 5 folds, and evaluate the performance by using the RMSE score.

We see that the optimal value of $\lambda$ is $4.7$, giving a RMSE score of $8192$.

Let's plot the mean RMSE score as a function of $\lambda$, to see how it varies.

Apparently, the RMSE score remains almost constant, no matter the value of $\lambda$, except for very high values. This could indicate that we won't benefit much from Ridge regression.

Let's see how the value of the coefficients decreases as $\lambda$ increases. We also plot a vertical black line on the optimal $\lambda$ value we found earlier.

The `is_smoker`

variable appears to be very important for the model, as it has a high value, and is the last to go to zero as $\lambda$ increases.

We can also use bootstrapping to get an estimate of the range of the coefficients.

## Lasso

Let's now try Lasso regression instead.

We find the optimal value of $\lambda$ by grid search in log space. We use cross validation with 5 folds, and evaluate the performance by using the RMSE score.

We see that the optimal value of $\lambda$ is $0.062$, giving a RMSE score of $7650$.

Let's plot the mean RMSE score as a function of $\lambda$, to see how it varies.

We still see a lot of uncertainty with regards to RMSE, no matter the value of $\lambda$.

Let's see how the value of the coefficients decreases as $\lambda$ increases. We also plot a vertical black line on the optimal $\lambda$ value we found earlier.

We again see that the `is_smoker`

variable is the most important for the model.

We will use bootstrapping again to get an estimate of the range of the coefficients.

## Group Lasso

Let's now try Group Lasso regression.

Group Lasso has two separate $\lambda$ values, one for "regular" covariates, and one for group covariates. We find the optimal value of both $\lambda$ values by grid search in log space over both. Because the Group Lasso as two hyperparameters, we reduce the computational burden by using a smaller search space. We use cross validation with 5 folds, and evaluate the performance by using the RMSE score.

## Results

We look at the performance of these different models on the test set. Firstly, it is clear that all shrinkage methods generalize much better than MLR, as the RMSE of the shrinkage methods is much better than the RMSE of the MLR. Secondly, the Ridge model performs well, even though it has much worse cross-validation RMSE than either of the lasso methods. Thirdly, while our model selection shows that even though Group Lasso is supposed to be the model that generalizables best, it underforms on the test set compared to regular Lasso. Our hypothesis is that Group Lasso is more prone to overfitting because it has more hyperparameters to optimize. In other words, Group Lasso might be able to fit the the noise in the dataset better without being penalized as much, compared to regular Lasso. The reason for this might be that by using cross validation, the size of the training set is reduced even further, making overfitting even more of an issue.