Assignment 2 - DAT405
Emil Josefsson & Tuyen Ngo
Time spent per person:
Emil Josefsson: 13 h
Tuyen Ngo: 13 h
1a) Linear Regression Model: Living area to selling price
To calculate and find a linear regression model that relates living area to selling price, we use the
sklearn.linear_model package. The
package utilizes the Least Squares method which affects accuracy negativiely.
1b) Values of slope and intersept of the regression line
Slope : retrieved using the
coef_ getter on the model =
Intercept : retrieved using the
intercept_ getter on the model =
1c) Predicting selling prices
Predicting the selling prices can be done using the
.predict()-function of the model.
100 m2: predicted to cost
150 m2: predicted to cost
200 m2: predicted to cost
1d) Residual plot
Residuals are calculated using real value subtracted with predicted value: ri = yi - f(xi)
1e) Results discussion
- We see from the residual plot that there are many data points around the middle and above 0. It doesn't look like a uniform random pattern, but resembles a little of an inverted U-shape.
- The data points are also very spread out for the lower right part, which indicates that the model has a high variance.
Due to the two points made above, we are not sure if linear regression is the best model. However, these residual points are probably affected by the outliers in the given dataset. The linear regression might have had a tighter fit if we excluded outliers such as villas with big living areas sold to a much lower price than predicted. We are not sure if this type of sale is considered "normal" – maybe the owners chose to sell it at a much lower price to someone they knew. On the other hand, it could be a completely normal sale, and that the villas in question had a worse location, and/or in a bad condition due to age and lack of care. We do not take these variables into account, so we can't be too confident whether we should consider these data points as outliers (in the selling price-aspect) or not.
Thus, we wouldn't be sure to use this as a model to predict future selling prices. Even if a linear regression might fit for this situation, there's still high variance as seen from the residual plot above—we lack statistical significance with this model. And it is reasonable that we don't get a perfect correlation between selling price and living area, since there are more factors affecting the selling price than just the living area.
However, if we had to improve the model, we would remove the outliers, even if it is questionable as discussed above. Otherwise (and much more preferable), we would include more dimensions to the data, such as distance from the city center, or the time since the last big renovation. That could shine light on why some properties are sold to a much lower price despite a larger living area.
2a) Confusion matrix to evaluate the use of logistic regressions for classifications
The use of logistic regression to classify the iris data set yielded an accuracy score of roughly 87%. We can see in the confusion matrix that everything was predicted correctly except for Iris virginica, which was instead commonly misclassified as Iris versicolor.
2b) K-nearest neighbours to classify iris data
Testing different values for K and alternating between uniform weights and distance-based weights. The K-values chosen are random and range from small to big, because we want to see if there are any extreme effects by scaling K. We plot the confusion matrices below where:
- Every row has the same K-value,
- The first column uses uniform weights, and
- The second column uses distance-based weights.
We also summarise the different accuracy scores in a table, below the confusion matrices.
We tried two types of weights (uniform and distance-based) with 5 different Ks: 3, 6, 25, 50, and 100. There are two notable findings from the results:
- Uniform weights: Larger K -> lower accuracy
- Distance-based weights: The accuracy is always the highest yielded, no matter which K we choose.
For the model with uniform weights, we expect this behavior to be a result of underfitting; as K increases to the size of the dataset, it is more likely to classify different objects equally.
For the model with distance-based weights, the accuracies are always the highest possible, because data points closer to the test point "have a higher influence" (according to the docs) than those farther away. So even though we increase K by a lot, only the closest points will matter, which will yield in better classification accuracy.
The results we obtain are reasonable, and we can make sense of it by looking at the scatter plots for the data set's three dimensions below (sepal length, sepal width, and petal length). The datapoints for Iris virginica and Iris versicolor intersperse with each other in both scatter plots, and by increasing K it would be more difficult for the model to classify a virginica or a versicolor correctly. Furthermore, it seems that virginica is even more spread out among versicolor datapoints in the second scatter plot (sepal width vs sepal length), so it can explain why virginica is the most misclassified, and often most misclassified as versicolor. With a large K, we get an underfitting since it takes too many data points into account and just count the most frequent type in its surroundings. However, by considering the distance of the data points from the query point, and weight them inversely based on that, the distance-based model can better classify the test data point.
2c) Comparing the different classification models
Note: we will refer backwards to the confusion matrices, and the table from 2b), since we already calculated those.
Results-wise, we can see that K-nearest neighbors with a small K had the best performance (97%), if the weights were uniform. If the weights were distance-based, then any K could have worked, but computing-wise, it's probably more performant with a smaller K since the calculations don't need to take unnecessarily many points into account. Using logistic regression we achieved an accuracy score of 87% as the model had 5 incorrect classifications. The K-nearest neighbors win this round with an accuracy score of 97% since it only had one classification error when it had a smaller K-value, or when it weighted the data points based on distance.
It's worth mentioning that we only split the data once at the start of this notebook (75% for the model to train on, 25% to test on), and then used the same training set and test set throughout all of our models. We're not iterating through the whole data set and changing what the test data set to test our models as you do when cross validating. If we could do cross validation as demonstrated in StatQuest we could obtain a score for each model, and be more confident which model is the better one.
3) Importance of separate test set (and sometimes validation set)
If one were to use the same training data as test data, it is always possible to get a perfect accuracy score (100%) on the model. This is simply because the model is fed on the exact same data as it was trained on, and it already knows those values. When using a separate test dataset, we can find the error (accuracy score is lower) since the model knows nothing about the test data. Thus, the model can be evaluated correctly.
The same can be said for a validation set, since information on test data can "leak into the model" due to our choices of hyperparameters, so we need more unknown data to properly evaluate the model. For instance, we might now choose K-nearest neighbor with a small K, and uniform weights, due to our results from 2b) – our choice has been influenced by the results of the test data when testing out different hyperparameters. Thus, we need another set of unknown data to test the model, i.e. the validation set.