# Assignment 2 - DAT405

Emil Josefsson & Tuyen Ngo

## Time spent per person:

Emil Josefsson: 13 h

Tuyen Ngo: 13 h

### 1a) Linear Regression Model: Living area to selling price

To calculate and find a linear regression model that relates living area to selling price, we use the `sklearn.linear_model`

package. The
package utilizes the Least Squares method which affects accuracy negativiely.

### 1b) Values of slope and intersept of the regression line

** Slope **: retrieved using the `coef_`

getter on the model = `19370.13854733`

(SEK/m^{2})

** Intercept **: retrieved using the `intercept_`

getter on the model = `2220603.24335587`

(SEK)

### 1c) Predicting selling prices

Predicting the selling prices can be done using the `.predict()`

-function of the model.

** 100 m ^{2}**: predicted to cost

`4157617.09808903`

SEK** 150 m ^{2}**: predicted to cost

`5126124.02545561`

SEK** 200 m ^{2}**: predicted to cost

`6094630.95282218`

SEK### 1d) Residual plot

Residuals are calculated using real value subtracted with predicted value: r_{i} = y_{i} - f(x_{i})

### 1e) Results discussion

- We see from the residual plot that there are many data points around the middle and above 0. It doesn't look like a uniform random pattern, but resembles a little of an inverted U-shape.
- The data points are also very spread out for the lower right part, which indicates that the model has a high variance.

Due to the two points made above, we are not sure if linear regression is the best model. However, these residual points are probably affected by the outliers in the given dataset. The linear regression might have had a tighter fit if we excluded outliers such as villas with big living areas sold to a much lower price than predicted. We are not sure if this type of sale is considered "normal" – maybe the owners chose to sell it at a much lower price to someone they knew. On the other hand, it could be a completely normal sale, and that the villas in question had a worse location, and/or in a bad condition due to age and lack of care. We do not take these variables into account, so we can't be too confident whether we should consider these data points as outliers (in the selling price-aspect) or not.

Thus, we wouldn't be sure to use this as a model to predict future selling prices.
Even if a linear regression *might* fit for this situation, there's still high
variance as seen from the residual plot above—we lack statistical significance with
this model. And it is reasonable that we don't get a perfect correlation between
selling price and living area, since there are more factors affecting the selling
price than just the living area.

However, if we *had to* improve the model, we would remove the outliers, even if it is
questionable as discussed above. Otherwise (and much more preferable), we would include
more dimensions to the data, such as distance from the city center, or the time since
the last big renovation. That could shine light on why some properties are sold to a
much lower price despite a larger living area.

### 2a) Confusion matrix to evaluate the use of logistic regressions for classifications

The use of logistic regression to classify the iris data set yielded an accuracy score of **roughly 87%**. We can see in the confusion matrix that
everything was predicted correctly except for *Iris virginica*, which was instead commonly misclassified as *Iris versicolor*.

### 2b) K-nearest neighbours to classify iris data

Testing different values for K and alternating between *uniform* weights and *distance-based* weights. The K-values chosen are
random and range from small to big, because we want to see if there are any extreme effects by scaling K. We plot the confusion matrices
below where:

- Every row has the same K-value,
- The first column uses uniform weights, and
- The second column uses distance-based weights.

We also summarise the different accuracy scores in a table, below the confusion matrices.

#### Results discussion

We tried two types of weights (uniform and distance-based) with 5 different Ks: 3, 6, 25, 50, and 100. There are two notable findings from the results:

**Uniform weights**: Larger K -> lower accuracy**Distance-based weights**: The accuracy is always the highest yielded, no matter which K we choose.

For the model with uniform weights, we expect this behavior to be a result of underfitting; as K increases to the size of the dataset, it is more likely to classify different objects equally.

For the model with distance-based weights, the accuracies are always the highest possible, because data points closer to the test point "have a higher influence" (according to the docs) than those farther away. So even though we increase K by a lot, only the closest points will matter, which will yield in better classification accuracy.

The results we obtain are reasonable, and we can make sense of it by looking at the scatter plots for the data set's three
dimensions below (sepal length, sepal width, and petal length). The datapoints for *Iris virginica* and *Iris versicolor*
intersperse with each other in both scatter plots, and by increasing K it would be more difficult for the model to classify
a *virginica* or a *versicolor* correctly. Furthermore, it seems that *virginica* is even more spread out among *versicolor* datapoints
in the second scatter plot (sepal width vs sepal length), so it can explain why *virginica* is the most misclassified, and
often most misclassified as *versicolor*. With a large K, we get an underfitting since it takes too many data points into
account and just count the most frequent type in its surroundings. However, by considering the distance of the data points
from the query point, and weight them inversely based on that, the distance-based model can better classify the test data point.

### 2c) Comparing the different classification models

*Note: we will refer backwards to the confusion matrices, and the table from 2b),
since we already calculated those.*

Results-wise, we can see that K-nearest neighbors with a small K had the best performance (97%), if the weights were uniform. If the weights were distance-based, then any K could have worked, but computing-wise, it's probably more performant with a smaller K since the calculations don't need to take unnecessarily many points into account. Using logistic regression we achieved an accuracy score of 87% as the model had 5 incorrect classifications. The K-nearest neighbors win this round with an accuracy score of 97% since it only had one classification error when it had a smaller K-value, or when it weighted the data points based on distance.

It's worth mentioning that we only split the data once at the start of this notebook (75% for the model to train on, 25% to test on), and then used the same training set and test set throughout all of our models. We're not iterating through the whole data set and changing what the test data set to test our models as you do when cross validating. If we could do cross validation as demonstrated in StatQuest we could obtain a score for each model, and be more confident which model is the better one.

### 3) Importance of separate test set (and sometimes validation set)

If one were to use the same training data as test data, it is always possible to get a perfect accuracy score (100%) on the model. This is simply because the model is fed on the exact same data as it was trained on, and it already knows those values. When using a separate test dataset, we can find the error (accuracy score is lower) since the model knows nothing about the test data. Thus, the model can be evaluated correctly.

The same can be said for a validation set, since information on test data can "leak into the model" due to our choices of hyperparameters, so we need more unknown data to properly evaluate the model. For instance, we might now choose K-nearest neighbor with a small K, and uniform weights, due to our results from 2b) – our choice has been influenced by the results of the test data when testing out different hyperparameters. Thus, we need another set of unknown data to test the model, i.e. the validation set.