Computing Bias and Variance
Tushar Choudhary - 2019111019
Suyash Vardhan Mathur - 2019114006
We were given two datasets. The training dataset contained 8000 pairs and the testing set contained 80 pairs of (xᵢ,yᵢ). The data was loaded using
pickle.load() function and shuffled using
np.random.shuffle(). Next, we iterated through polynomials of degree 1 to 20 and split the training data into 10 parts to create 10 different models for each degree. The models were generated using Polynomial Features and Linear Regression and a mean of all 10 models was taken to get an expected prediction of test data. After that different errors were calculated and tabulated. A graph for error vs degree was also drawn for bias², variance, and the total error.
Linear Regression for a polynomial of degree
n tries to form a relationship between the independent variable x and dependent variable y modelled in the form of a nth degree polynomial.
fits a Linear Model of a polynomial with degree p with coefficients
w = (w₁, w₂, ..., wₚ) and constant
w₀ to minimize the residual sum of squares between the observed targets in the dataset, and the targets that have been predicted by the linear approximation. The number of coefficients are equal to the degree
p of the polynomial along with a constant term w₀.
Thus, the function fits a linear model with co-efficients w₁, w₂, ..., wₚ such that residual sum of squares between predicted targets and observed targets in the dataset is minimum, i.e.
Code for LinearRegression().fit() -
model = LinearRegression() initializes the LinearRegression object for using in our code.
model.fit(x_poly,current_y) fits a model with training data as
x_poly which is the transformed polynomial of degree
current_y as the target values
of that training data.
Bias is the difference between the average prediction of our model and the actual value which we are trying to predict. It arises from erroneous assumptions in the learning algorithm. An underfit model (too low complexity) ideally has high bias where on the other hand an overfit model (too high complexity) ideally has low bias. High bias can cause an algorithm to miss the relevant relations between features and target outputs.
The formula for the bias of a model:
Bias = abs( E[f'(x)] - f(x) )
Where f(x) is the actual data and f'(x) is the approximated value of f(x) from our model. E[f'(x)] is obtained by taking the mean of f'(x) over all models.
Variance is the variability of the model prediction for a given data point. It is how much the predictions for a given point vary between different realizations of the model, that is, the amount that the estimate of the target function will change if different training data was used.
The formula for the variance of a model:
Variance = E[ ( f'(x) - E[f'(x)] )² ]
Code for bias and variance
The values for bias and variance for each degree have been tabulated in Task 3.
Irreducible error is the error that cannot be reduced by creating better models. It is the error due to the noise in the data. Since this error is produced due to the noise in our data, and data remains constant throughout the different complexities(degree of polynomials) that we are using, it remains constant. As can be seen in the table below, the irreducible error is of the same order(nearly 0) for all the differnt complexities being used, and thus, remains unchanged. This error is always positive in value. [There is a deviation of 10⁻¹¹ from 0 in the values of irreducible error, due to the limit of floating point precision in computations in Python.]
The Error vs Degree graph has been plotted below for Bias², variance and total error below -
From the above graph and tabulated values, we observe:
- With an increase in complexity of model (degree of polynomial) the bias decreases and variance increases.
- The minima of Total Error occurs at degree 3, and thus, we can say that the most appropriate model for the given dataset is that of degree 3.
- The model has best fit at complexity 3.
- For complexities lower than 3, we can say that the model is underfit as it oversimplifies the features/complexity for training and thus has a high bias.
- For complexities higher than 3, we can see that the variance and the total error increase. Thus, we can say that the model is overfit for complexities greater than three. This is because the model tries memorize features too specific to the training data, which aren't generalizable, and thus performs poorly on the training dataset, increasing the variance in predictions.
- While in an ideal case, the bias should have decreased throughout with an increase in complexity. However, due to more overfitting and learning of the noise in the data by our model,we can see that there is a slight increase in bias in higher complexities.
- Since for complexities < 3, bias>>variance, and so total error is closer to the bias, whereas after complexity 3, bias<variance, and so the curve of total error is lot closerto variance curve.
Lower degree polynomials were underfitting in nature, and underfit models don't perform well on the training OR the testing sets. Such models don't extract the features from the dataset and can't capture the underlying trend of the data. For lower degree polynomials, we can observe underfit models with high bias as they are unable to capture all features of the data because they oversimplify the features from the data. We can observe underfit models for complexities lesser that 3 in the graph, with a high bias and a lower variance.
Overfitting occurs as a result of the phenomenon of memorization. That is with increased complexity of the model (essentially the degree of the polynomial), the model extracts too much information from the training sets and works well with them. Thus, the model learns patterns too specific to the training data that it won't hold across the testing data, that is, it isn't generalizable. Thus, due to extracting too much data from the training set, the model may perform poorly on data sets that have not been seen before. This is reflected in the increase in variance with an increase in complexity. Thus the model is said to be overfitting. Higher degree polynomials can usually be overfitting. Since the increase in complexity of model, in this case, means increase in the count of maxima and minima, the goodness of fit is disturbed and hence the bias also increases towards the end. Thus, Memorisation leads to incorrect assumptions in the learning algorithm which increases the error.
- The data corresponds to 3-degree polynomials, as we can see that complexity 3 is the minima for the total error.
- There is minimal noise in the data, as the value of irreducible error is of the order of 10⁻¹¹ in the results.
Given below is the graph for the datasets