1. Importing libraries and downloading the dataset

import pandas as pd import numpy as np import matplotlib import matplotlib.pyplot as plt import seaborn as sns import statsmodels.api as sm from sklearn.decomposition import PCA from sklearn.metrics import mean_squared_log_error from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from pandas.plotting import scatter_matrix matplotlib.style.use('ggplot') RANDOM_STATE = 42

dt = pd.read_csv("data.csv") dt

dt.info()

2. Transformation of features

Defining variable types

At first, let's try to separate categorical and numerical features we have

# Separate Numerical and Categorical Features num_features = [] cat_features = [] for feature in dt.columns: if dt[feature].dtypes == 'O': # Categorical feature if datatype is object. cat_features.append(feature) else: # Numerical if otherwise num_features.append(feature) print("Numerical Features are: ", num_features) print("Categorical Features are: ", cat_features)

Our separation seems to be working fine. Let's now handle with different type with the features which we have detected.

dt_cat = dt [cat_features] dt_cat.info()

Handling with numerical features

Missing values

dt_num = dt [num_features] dt_num.info()

dt_num.columns[dt_num.isnull().mean()>0].tolist()

We can mention that LotFrontage, MasVnrArea, GarageYrBlt have some missing values. Let's fill null values with the mean of the corresponding feature.

dt_num['LotFrontage'] = dt_num['LotFrontage'].fillna(round(dt_num.LotFrontage.mean())) dt_num['MasVnrArea'] = dt_num['MasVnrArea'].fillna(round(dt_num.MasVnrArea.mean())) dt_num['GarageYrBlt'] = dt_num['GarageYrBlt'].fillna(round(dt_num.GarageYrBlt.mean()))

dt_num.info()

Now none of the numerical features has null values.

Checking for multi-collinearity

Multicollinearity occurs when independent variable are correlated that will make quality of our model lower. Thus, we can plot all of these features and get rid of multicollinearity. Multicollinearity can undermine the importance of a feature for our model and also have not correct influence on regression model which we are going to use further.

We want to detect influence of independent variables on each other so let's drop our dependent variable 'SalePrice' for creating a correlation matrix.

corrmat = dt_num.drop(['SalePrice'], axis = 1) corrmat=corrmat.corr() f, ax = plt.subplots(figsize=(12,12)) sns.heatmap(corrmat,mask=corrmat<0.75,linewidth=0.5,cmap="Blues", square=True)

Thus, highly inter-correlated variables are:

GarageYrBlt and YearBuilt

TotRmsAbvGrd and GrLivArea

1stFlrSF and TotalBsmtSF

GarageArea and GarageCars

That means that these variable give us pretty the same information as other features, so they can be deleted.

dt_num.drop(['GarageYrBlt','TotRmsAbvGrd','1stFlrSF','GarageArea'], axis=1, inplace=True)

dt_cat.info()

One Hot Encoding

Let's take some categorical variables and apply one hot encoding so they can be provided to machine learning algorithms to improve predictions. It's the process of converting categorical data variables

a = ['LotShape', 'LotConfig','LandSlope','BldgType','RoofStyle','RoofMatl','Foundation'] EN = pd.get_dummies(dt_cat[a]) EN.info()

Outliers

As our aim is to optimize RMSLE, we can ignore outliers.

In the case of RMSE, the presence of outliers can explode the error term to a very high value. But, in the case of RMSLE the outliers are drastically scaled-down therefore nullifying their effect.

Creating dataset to work with

x = dt_num.drop('SalePrice', axis =1 ) x.info()

X = pd.concat([x,EN], axis=1, join='inner' ) Y = dt_num[['SalePrice']] X.info()

Split into train and test data

X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.25, random_state=RANDOM_STATE)

X.dtypes.value_counts()

Features normalization

Bringing features onto the same scale

Rescale features in a way that they are more comparable. Normalization of some features can improve performance.

As we have only numeric values (int64, uint8 and float64) we can apply rescaling on all of the data

Let's apply on our dataset min-max normalization

from sklearn import preprocessing # Min-max normalization minmax_scaler = preprocessing.MinMaxScaler() columns = X.columns # Prepare values for the transformation minmax_scaler.fit(X_train[columns]) # Transform of both train and validation # MinMax X_train_mm= X_train.copy() X_test_mm = X_test.copy() X_train_mm [columns] = minmax_scaler.transform(X_train[columns]) X_test_mm [columns] = minmax_scaler.transform(X_test[columns])

# Standard normalization st_scaler = preprocessing.StandardScaler() columns = X.columns # Prepare values for the transformation st_scaler.fit(X_train[columns]) # Transform of both train and validation # MinMax X_train_ss= X_train.copy() X_test_ss = X_test.copy() X_train_ss [columns] = st_scaler.transform(X_train[columns]) X_test_ss [columns] = st_scaler.transform(X_test[columns])

Our data is prepared (X_train_mm X_test_mm ) so let's use it for further choosing subsets and analysis.

RMSLE

def RMSLE(y_true: np.ndarray, y_pred: np.ndarray) -> np.float64: return np.sqrt(mean_squared_log_error(y_true, y_pred))

3. Finding some suitable subset of features

Modeling

Let's see the performance of our model without features' normalization. Let's drop 'Id' because it won't give us any useful insights as it is only for domain knowledge

X_train_before= X_train.drop(['Id'],axis =1) X_test_before = X_test.drop(['Id'],axis =1)

X_train_before = sm.add_constant(X_train_before) X_test_before = sm.add_constant(X_test_before) model= sm.OLS(Y_train,X_train_before) results = model.fit() print(results.summary())

Y_pred_train = results.predict(X_train_before) Y_pred_test = results.predict(X_test_before)

RMSLE_train = RMSLE(Y_train.values, Y_pred_train) RMSLE_train

RMSLE_test = RMSLE(Y_test.values, Y_pred_test) RMSLE_test

RMSLE is usually used when you don't want to penalize huge differences in the predicted and true values when both predicted and true values are huge numbers. In these cases only the percentual differences matter since you can rewrite.

On our train dataset RMSLE is 0.177437 which is not a bad result.

On our test dataset we can mention that it's greater so that means that on the train data our model performs better.

Model after feature normalization

X_train_mm = X_train_mm.drop(['Id'],axis =1) X_test_mm = X_test_mm.drop(['Id'],axis =1)

X_train_mm_lin = sm.add_constant(X_train_mm) X_test_mm_lin = sm.add_constant(X_test_mm) model= sm.OLS(Y_train,X_train_mm_lin) results = model.fit() print(results.summary())

Y_pred_train_scal = results.predict(X_train_mm_lin) Y_pred_test_scal = results.predict(X_test_mm_lin)

RMSLE_train_scal= RMSLE(Y_train.values, Y_pred_train_scal) RMSLE_train_scal

RMSLE_test_scal = RMSLE(Y_test.values, Y_pred_test_scal) RMSLE_test_scal

Lasso

Lasso Regression is a model that uses L1 norm that promotes sparsity of features. It can be used for Feature selection because it shrinks the coefficients of useless features to 0.

Let's choose alpha = 10 , when alpha is 0, Lasso regression produces the same coefficients as a linear regression. When alpha is very very large, all coefficients are zero.

# Select from model method from sklearn.linear_model import Lasso from sklearn.feature_selection import SelectFromModel, SelectKBest # Choose alpha value to be low because it will allow less number of features to shrink to 0. alpha = 10 selected_features_support = SelectFromModel(Lasso(alpha=alpha, random_state = RANDOM_STATE)) selected_features_support.fit(X_train_mm, Y_train)

Let's see how our model works when dropping these two columns

selected_features_selectfrommodel = X_train_mm.columns[selected_features_support.get_support()] print(len(selected_features_selectfrommodel), "\n", selected_features_selectfrommodel)

X_train_mm.info()

X_train_mm_lasso = X_train_mm[selected_features_selectfrommodel] X_test_mm_lasso = X_test_mm[selected_features_selectfrommodel]

Let's check how the performance of our model changes when we drop those columns.

X_train_mm_lasso_lin = sm.add_constant(X_train_mm_lasso) X_test_mm_lasso_lin = sm.add_constant(X_test_mm_lasso) model= sm.OLS(Y_train,X_train_mm_lasso_lin) results = model.fit() print(results.summary())

Y_pred_train_lasso = results.predict(X_train_mm_lasso_lin) Y_pred_test_lasso = results.predict(X_test_mm_lasso_lin)

RMSLE_train_lasso = RMSLE(Y_train.values, Y_pred_train_lasso) RMSLE_train_lasso

RMSLE_test_lasso = RMSLE(Y_test.values, Y_pred_test_lasso) RMSLE_test_lasso

print ("RMSLE for train data", RMSLE_train) print ("RMSLE for test data ", RMSLE_test) print ("RMSLE for train data after normalization", RMSLE_train_scal) print ("RMSLE for test data after normalization", RMSLE_test_scal) print ("RMSLE for train data after dropping columns using Lasso", RMSLE_train_lasso) print ("RMSLE for test data after dropping columns using Lasso", RMSLE_test_lasso)

K Best

SelectKBest (here usually K should be selected using cross validation but for now lets say k = 30)

from sklearn.feature_selection import f_regression K = 30 selected_features_kbest = SelectKBest(score_func=f_regression, k=K) feature_scores = selected_features_kbest.fit(X_train_mm, Y_train) scores = pd.DataFrame(feature_scores.scores_) columns = pd.DataFrame(X_train_mm.columns) final_scores = pd.concat([columns, scores], axis=1) final_scores.columns = ['Features', 'Score'] k_best_scores = final_scores.nlargest(K, "Score") print(k_best_scores)

Let's randomly take 10 features which we have recently defined using K Score.

X_train_mm_k = X_train_mm[['TotalBsmtSF','YearBuilt' ,'Fireplaces','RoofStyle_Gable', 'BedroomAbvGr','MasVnrArea','Foundation_PConc' ,'FullBath','2ndFlrSF','GrLivArea']] X_test_mm_k =X_test_mm[['TotalBsmtSF','YearBuilt' ,'Fireplaces','RoofStyle_Gable' , 'BedroomAbvGr','MasVnrArea','Foundation_PConc' ,'FullBath','2ndFlrSF','GrLivArea']]

Let's check the performance of our model using linear regression

X_train_mm_k_lin = sm.add_constant(X_train_mm_k) X_test_mm_k_lin = sm.add_constant(X_test_mm_k) model= sm.OLS(Y_train,X_train_mm_k_lin) results = model.fit() print(results.summary())

Y_pred_train_k = results.predict(X_train_mm_k_lin) Y_pred_test_k = results.predict(X_test_mm_k_lin)

RMSLE_train_k = RMSLE(Y_train.values, Y_pred_train_k) RMSLE_train_k

RMSLE_test_k = RMSLE(Y_test.values, Y_pred_test_k) RMSLE_test_k

4. PCA

PCA is adopted to reduce the number of features of the dataset and simplify the learning model accordingly.

Let's create a loop which will try different quantity of components and we will find the best performance of our model checking RMSLE on train and test data.

for n in range(5,40,4): print('n_components: ',n) pca = PCA(n_components = n) pca.fit(X_train_mm) x_pca_train = pca.transform(X_train_mm) x_pca_test = pca.transform(X_test_mm) x_pca_train = sm.add_constant(x_pca_train) x_pca_test = sm.add_constant(x_pca_test) model_pca = sm.OLS(Y_train.values,x_pca_train) results_pca = model_pca.fit() #print(results_pca.summary()) Y_pred_train_pca = results_pca.predict(x_pca_train) Y_pred_test_pca = results_pca.predict(x_pca_test) print('Train RMSLE: ',RMSLE(Y_train.values,Y_pred_train_pca)) print('Test RMSLE: ',RMSLE(Y_test.values,Y_pred_test_pca))

Here we can see that the more parameters PCA gets the less RMSLE we have, this means the better result we gain and our model works better.

The best performance of our model was using 33 components Train RMSLE: 0.1922395958900697 Test RMSLE: 0.20160758496416994

Discussion

Here we can see all our results

And the best result of PCA applying was with 33 components:

Train RMSLE: 0.1922395958900697

Test RMSLE: 0.20160758496416994

We can mention that on train dataset PCA result (0.1922395958900697) is approximately greater than every value (0.17743789941261642, 0.17743789941194163, 0.1768844465501454)

(except after using K Score 0.19385109462355005) -> that's not a good behavior of the model. Actually, the difference is not very noticeable, but we can see it.

And it's interesting that on test data PCA has one of the smallest RMSLE values (except using K Score)

We can see that we have the best results on our test dataset using those features which were chosen using K Score, that means such features as are: 'TotalBsmtSF','YearBuilt' ,'Fireplaces','RoofStyle_Gable', 'BedroomAbvGr','MasVnrArea','Foundation_PConc', 'FullBath','2ndFlrSF','GrLivArea ' .

We cannot unambiguously determine which of the methods worked best for us. RMSLE value actually was not very high for every method and it means that the performance of the model was not very bad.

Comments

I have tried different types of data normalization. Unfortunately, StandardScaler didn't work for me for optimizing RMSLE 'mean squared Logarithmic Error cannot be used when targets contain negative values'.

I have also tried different alpha values for Lasso regression for choosing the appropriate model but this one (alpha =10) for me was the best.

Regarding K-Best method I wanted to choose first 20 features but some of them also caused the same error for RMSLE, that's why I accurately tried adding variables by myself.

While using PCA I also got an error when calculating the RMSLE, so parameters for the loop also were chosen accurately.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}1. Importing libraries and downloading the dataset