1. Importing libraries and downloading the dataset
2. Transformation of features
Defining variable types
At first, let's try to separate categorical and numerical features we have
Our separation seems to be working fine. Let's now handle with different type with the features which we have detected.
Handling with numerical features
Missing values
We can mention that LotFrontage, MasVnrArea, GarageYrBlt have some missing values. Let's fill null values with the mean of the corresponding feature.
Now none of the numerical features has null values.
Checking for multi-collinearity
Multicollinearity occurs when independent variable are correlated that will make quality of our model lower. Thus, we can plot all of these features and get rid of multicollinearity. Multicollinearity can undermine the importance of a feature for our model and also have not correct influence on regression model which we are going to use further.
We want to detect influence of independent variables on each other so let's drop our dependent variable 'SalePrice' for creating a correlation matrix.
Thus, highly inter-correlated variables are:
That means that these variable give us pretty the same information as other features, so they can be deleted.
One Hot Encoding
Let's take some categorical variables and apply one hot encoding so they can be provided to machine learning algorithms to improve predictions. It's the process of converting categorical data variables
Outliers
As our aim is to optimize RMSLE, we can ignore outliers.
In the case of RMSE, the presence of outliers can explode the error term to a very high value. But, in the case of RMSLE the outliers are drastically scaled-down therefore nullifying their effect.
Creating dataset to work with
Split into train and test data
Features normalization
Bringing features onto the same scale
Rescale features in a way that they are more comparable. Normalization of some features can improve performance.
As we have only numeric values (int64, uint8 and float64) we can apply rescaling on all of the data
Let's apply on our dataset min-max normalization
Our data is prepared (X_train_mm X_test_mm ) so let's use it for further choosing subsets and analysis.
RMSLE
3. Finding some suitable subset of features
Modeling
Let's see the performance of our model without features' normalization. Let's drop 'Id' because it won't give us any useful insights as it is only for domain knowledge
RMSLE is usually used when you don't want to penalize huge differences in the predicted and true values when both predicted and true values are huge numbers. In these cases only the percentual differences matter since you can rewrite.
On our train dataset RMSLE is 0.177437 which is not a bad result.
On our test dataset we can mention that it's greater so that means that on the train data our model performs better.
Model after feature normalization
Lasso
Lasso Regression is a model that uses L1 norm that promotes sparsity of features. It can be used for Feature selection because it shrinks the coefficients of useless features to 0.
Let's choose alpha = 10 , when alpha is 0, Lasso regression produces the same coefficients as a linear regression. When alpha is very very large, all coefficients are zero.
Let's see how our model works when dropping these two columns
Let's check how the performance of our model changes when we drop those columns.
K Best
SelectKBest (here usually K should be selected using cross validation but for now lets say k = 30)
Let's randomly take 10 features which we have recently defined using K Score.
Let's check the performance of our model using linear regression
4. PCA
PCA is adopted to reduce the number of features of the dataset and simplify the learning model accordingly.
Let's create a loop which will try different quantity of components and we will find the best performance of our model checking RMSLE on train and test data.
Here we can see that the more parameters PCA gets the less RMSLE we have, this means the better result we gain and our model works better.
The best performance of our model was using 33 components Train RMSLE: 0.1922395958900697 Test RMSLE: 0.20160758496416994
Discussion
Here we can see all our results
And the best result of PCA applying was with 33 components:
Train RMSLE: 0.1922395958900697
Test RMSLE: 0.20160758496416994
We can mention that on train dataset PCA result (0.1922395958900697) is approximately greater than every value (0.17743789941261642, 0.17743789941194163, 0.1768844465501454)
(except after using K Score 0.19385109462355005) -> that's not a good behavior of the model. Actually, the difference is not very noticeable, but we can see it.
And it's interesting that on test data PCA has one of the smallest RMSLE values (except using K Score)
We can see that we have the best results on our test dataset using those features which were chosen using K Score, that means such features as are: 'TotalBsmtSF','YearBuilt' ,'Fireplaces','RoofStyle_Gable', 'BedroomAbvGr','MasVnrArea','Foundation_PConc', 'FullBath','2ndFlrSF','GrLivArea ' .
We cannot unambiguously determine which of the methods worked best for us. RMSLE value actually was not very high for every method and it means that the performance of the model was not very bad.