1. Importing libraries and downloading the dataset
0
1
60
1
2
20
2
3
60
3
4
70
4
5
60
5
6
50
6
7
20
7
8
60
8
9
50
9
10
190
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 79 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MoSold 1460 non-null int64
75 YrSold 1460 non-null int64
76 SaleType 1460 non-null object
77 SaleCondition 1460 non-null object
78 SalePrice 1460 non-null int64
dtypes: float64(3), int64(34), object(42)
memory usage: 901.2+ KB
2. Transformation of features
Defining variable types
At first, let's try to separate categorical and numerical features we have
Numerical Features are: ['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MoSold', 'YrSold', 'SalePrice']
Categorical Features are: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'SaleType', 'SaleCondition']
Our separation seems to be working fine. Let's now handle with different type with the features which we have detected.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 42 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSZoning 1460 non-null object
1 Street 1460 non-null object
2 Alley 91 non-null object
3 LotShape 1460 non-null object
4 LandContour 1460 non-null object
5 Utilities 1460 non-null object
6 LotConfig 1460 non-null object
7 LandSlope 1460 non-null object
8 Neighborhood 1460 non-null object
9 Condition1 1460 non-null object
10 Condition2 1460 non-null object
11 BldgType 1460 non-null object
12 HouseStyle 1460 non-null object
13 RoofStyle 1460 non-null object
14 RoofMatl 1460 non-null object
15 Exterior1st 1460 non-null object
16 Exterior2nd 1460 non-null object
17 MasVnrType 1452 non-null object
18 ExterQual 1460 non-null object
19 ExterCond 1460 non-null object
20 Foundation 1460 non-null object
21 BsmtQual 1423 non-null object
22 BsmtCond 1423 non-null object
23 BsmtExposure 1422 non-null object
24 BsmtFinType1 1423 non-null object
25 BsmtFinType2 1422 non-null object
26 Heating 1460 non-null object
27 HeatingQC 1460 non-null object
28 CentralAir 1460 non-null object
29 Electrical 1459 non-null object
30 KitchenQual 1460 non-null object
31 Functional 1460 non-null object
32 FireplaceQu 770 non-null object
33 GarageType 1379 non-null object
34 GarageFinish 1379 non-null object
35 GarageQual 1379 non-null object
36 GarageCond 1379 non-null object
37 PavedDrive 1460 non-null object
38 PoolQC 7 non-null object
39 Fence 281 non-null object
40 SaleType 1460 non-null object
41 SaleCondition 1460 non-null object
dtypes: object(42)
memory usage: 479.2+ KB
Handling with numerical features
Missing values
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1201 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1452 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 1stFlrSF 1460 non-null int64
14 2ndFlrSF 1460 non-null int64
15 LowQualFinSF 1460 non-null int64
16 GrLivArea 1460 non-null int64
17 BsmtFullBath 1460 non-null int64
18 BsmtHalfBath 1460 non-null int64
19 FullBath 1460 non-null int64
20 HalfBath 1460 non-null int64
21 BedroomAbvGr 1460 non-null int64
22 KitchenAbvGr 1460 non-null int64
23 TotRmsAbvGrd 1460 non-null int64
24 Fireplaces 1460 non-null int64
25 GarageYrBlt 1379 non-null float64
26 GarageCars 1460 non-null int64
27 GarageArea 1460 non-null int64
28 WoodDeckSF 1460 non-null int64
29 OpenPorchSF 1460 non-null int64
30 EnclosedPorch 1460 non-null int64
31 3SsnPorch 1460 non-null int64
32 ScreenPorch 1460 non-null int64
33 PoolArea 1460 non-null int64
34 MoSold 1460 non-null int64
35 YrSold 1460 non-null int64
36 SalePrice 1460 non-null int64
dtypes: float64(3), int64(34)
memory usage: 422.2 KB
We can mention that LotFrontage, MasVnrArea, GarageYrBlt have some missing values. Let's fill null values with the mean of the corresponding feature.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1460 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1460 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 1stFlrSF 1460 non-null int64
14 2ndFlrSF 1460 non-null int64
15 LowQualFinSF 1460 non-null int64
16 GrLivArea 1460 non-null int64
17 BsmtFullBath 1460 non-null int64
18 BsmtHalfBath 1460 non-null int64
19 FullBath 1460 non-null int64
20 HalfBath 1460 non-null int64
21 BedroomAbvGr 1460 non-null int64
22 KitchenAbvGr 1460 non-null int64
23 TotRmsAbvGrd 1460 non-null int64
24 Fireplaces 1460 non-null int64
25 GarageYrBlt 1460 non-null float64
26 GarageCars 1460 non-null int64
27 GarageArea 1460 non-null int64
28 WoodDeckSF 1460 non-null int64
29 OpenPorchSF 1460 non-null int64
30 EnclosedPorch 1460 non-null int64
31 3SsnPorch 1460 non-null int64
32 ScreenPorch 1460 non-null int64
33 PoolArea 1460 non-null int64
34 MoSold 1460 non-null int64
35 YrSold 1460 non-null int64
36 SalePrice 1460 non-null int64
dtypes: float64(3), int64(34)
memory usage: 422.2 KB
Now none of the numerical features has null values.
Checking for multi-collinearity
Multicollinearity occurs when independent variable are correlated that will make quality of our model lower. Thus, we can plot all of these features and get rid of multicollinearity. Multicollinearity can undermine the importance of a feature for our model and also have not correct influence on regression model which we are going to use further.
We want to detect influence of independent variables on each other so let's drop our dependent variable 'SalePrice' for creating a correlation matrix.
Thus, highly inter-correlated variables are:
GarageYrBlt and YearBuilt
TotRmsAbvGrd and GrLivArea
1stFlrSF and TotalBsmtSF
GarageArea and GarageCars
That means that these variable give us pretty the same information as other features, so they can be deleted.
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/frame.py:4315: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 42 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSZoning 1460 non-null object
1 Street 1460 non-null object
2 Alley 91 non-null object
3 LotShape 1460 non-null object
4 LandContour 1460 non-null object
5 Utilities 1460 non-null object
6 LotConfig 1460 non-null object
7 LandSlope 1460 non-null object
8 Neighborhood 1460 non-null object
9 Condition1 1460 non-null object
10 Condition2 1460 non-null object
11 BldgType 1460 non-null object
12 HouseStyle 1460 non-null object
13 RoofStyle 1460 non-null object
14 RoofMatl 1460 non-null object
15 Exterior1st 1460 non-null object
16 Exterior2nd 1460 non-null object
17 MasVnrType 1452 non-null object
18 ExterQual 1460 non-null object
19 ExterCond 1460 non-null object
20 Foundation 1460 non-null object
21 BsmtQual 1423 non-null object
22 BsmtCond 1423 non-null object
23 BsmtExposure 1422 non-null object
24 BsmtFinType1 1423 non-null object
25 BsmtFinType2 1422 non-null object
26 Heating 1460 non-null object
27 HeatingQC 1460 non-null object
28 CentralAir 1460 non-null object
29 Electrical 1459 non-null object
30 KitchenQual 1460 non-null object
31 Functional 1460 non-null object
32 FireplaceQu 770 non-null object
33 GarageType 1379 non-null object
34 GarageFinish 1379 non-null object
35 GarageQual 1379 non-null object
36 GarageCond 1379 non-null object
37 PavedDrive 1460 non-null object
38 PoolQC 7 non-null object
39 Fence 281 non-null object
40 SaleType 1460 non-null object
41 SaleCondition 1460 non-null object
dtypes: object(42)
memory usage: 479.2+ KB
One Hot Encoding
Let's take some categorical variables and apply one hot encoding so they can be provided to machine learning algorithms to improve predictions. It's the process of converting categorical data variables
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 37 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LotShape_IR1 1460 non-null uint8
1 LotShape_IR2 1460 non-null uint8
2 LotShape_IR3 1460 non-null uint8
3 LotShape_Reg 1460 non-null uint8
4 LotConfig_Corner 1460 non-null uint8
5 LotConfig_CulDSac 1460 non-null uint8
6 LotConfig_FR2 1460 non-null uint8
7 LotConfig_FR3 1460 non-null uint8
8 LotConfig_Inside 1460 non-null uint8
9 LandSlope_Gtl 1460 non-null uint8
10 LandSlope_Mod 1460 non-null uint8
11 LandSlope_Sev 1460 non-null uint8
12 BldgType_1Fam 1460 non-null uint8
13 BldgType_2fmCon 1460 non-null uint8
14 BldgType_Duplex 1460 non-null uint8
15 BldgType_Twnhs 1460 non-null uint8
16 BldgType_TwnhsE 1460 non-null uint8
17 RoofStyle_Flat 1460 non-null uint8
18 RoofStyle_Gable 1460 non-null uint8
19 RoofStyle_Gambrel 1460 non-null uint8
20 RoofStyle_Hip 1460 non-null uint8
21 RoofStyle_Mansard 1460 non-null uint8
22 RoofStyle_Shed 1460 non-null uint8
23 RoofMatl_ClyTile 1460 non-null uint8
24 RoofMatl_CompShg 1460 non-null uint8
25 RoofMatl_Membran 1460 non-null uint8
26 RoofMatl_Metal 1460 non-null uint8
27 RoofMatl_Roll 1460 non-null uint8
28 RoofMatl_Tar&Grv 1460 non-null uint8
29 RoofMatl_WdShake 1460 non-null uint8
30 RoofMatl_WdShngl 1460 non-null uint8
31 Foundation_BrkTil 1460 non-null uint8
32 Foundation_CBlock 1460 non-null uint8
33 Foundation_PConc 1460 non-null uint8
34 Foundation_Slab 1460 non-null uint8
35 Foundation_Stone 1460 non-null uint8
36 Foundation_Wood 1460 non-null uint8
dtypes: uint8(37)
memory usage: 52.9 KB
Outliers
As our aim is to optimize RMSLE, we can ignore outliers.
In the case of RMSE, the presence of outliers can explode the error term to a very high value. But, in the case of RMSLE the outliers are drastically scaled-down therefore nullifying their effect.
Creating dataset to work with
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1460 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1460 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 2ndFlrSF 1460 non-null int64
14 LowQualFinSF 1460 non-null int64
15 GrLivArea 1460 non-null int64
16 BsmtFullBath 1460 non-null int64
17 BsmtHalfBath 1460 non-null int64
18 FullBath 1460 non-null int64
19 HalfBath 1460 non-null int64
20 BedroomAbvGr 1460 non-null int64
21 KitchenAbvGr 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 WoodDeckSF 1460 non-null int64
25 OpenPorchSF 1460 non-null int64
26 EnclosedPorch 1460 non-null int64
27 3SsnPorch 1460 non-null int64
28 ScreenPorch 1460 non-null int64
29 PoolArea 1460 non-null int64
30 MoSold 1460 non-null int64
31 YrSold 1460 non-null int64
dtypes: float64(2), int64(30)
memory usage: 365.1 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 LotFrontage 1460 non-null float64
3 LotArea 1460 non-null int64
4 OverallQual 1460 non-null int64
5 OverallCond 1460 non-null int64
6 YearBuilt 1460 non-null int64
7 YearRemodAdd 1460 non-null int64
8 MasVnrArea 1460 non-null float64
9 BsmtFinSF1 1460 non-null int64
10 BsmtFinSF2 1460 non-null int64
11 BsmtUnfSF 1460 non-null int64
12 TotalBsmtSF 1460 non-null int64
13 2ndFlrSF 1460 non-null int64
14 LowQualFinSF 1460 non-null int64
15 GrLivArea 1460 non-null int64
16 BsmtFullBath 1460 non-null int64
17 BsmtHalfBath 1460 non-null int64
18 FullBath 1460 non-null int64
19 HalfBath 1460 non-null int64
20 BedroomAbvGr 1460 non-null int64
21 KitchenAbvGr 1460 non-null int64
22 Fireplaces 1460 non-null int64
23 GarageCars 1460 non-null int64
24 WoodDeckSF 1460 non-null int64
25 OpenPorchSF 1460 non-null int64
26 EnclosedPorch 1460 non-null int64
27 3SsnPorch 1460 non-null int64
28 ScreenPorch 1460 non-null int64
29 PoolArea 1460 non-null int64
30 MoSold 1460 non-null int64
31 YrSold 1460 non-null int64
32 LotShape_IR1 1460 non-null uint8
33 LotShape_IR2 1460 non-null uint8
34 LotShape_IR3 1460 non-null uint8
35 LotShape_Reg 1460 non-null uint8
36 LotConfig_Corner 1460 non-null uint8
37 LotConfig_CulDSac 1460 non-null uint8
38 LotConfig_FR2 1460 non-null uint8
39 LotConfig_FR3 1460 non-null uint8
40 LotConfig_Inside 1460 non-null uint8
41 LandSlope_Gtl 1460 non-null uint8
42 LandSlope_Mod 1460 non-null uint8
43 LandSlope_Sev 1460 non-null uint8
44 BldgType_1Fam 1460 non-null uint8
45 BldgType_2fmCon 1460 non-null uint8
46 BldgType_Duplex 1460 non-null uint8
47 BldgType_Twnhs 1460 non-null uint8
48 BldgType_TwnhsE 1460 non-null uint8
49 RoofStyle_Flat 1460 non-null uint8
50 RoofStyle_Gable 1460 non-null uint8
51 RoofStyle_Gambrel 1460 non-null uint8
52 RoofStyle_Hip 1460 non-null uint8
53 RoofStyle_Mansard 1460 non-null uint8
54 RoofStyle_Shed 1460 non-null uint8
55 RoofMatl_ClyTile 1460 non-null uint8
56 RoofMatl_CompShg 1460 non-null uint8
57 RoofMatl_Membran 1460 non-null uint8
58 RoofMatl_Metal 1460 non-null uint8
59 RoofMatl_Roll 1460 non-null uint8
60 RoofMatl_Tar&Grv 1460 non-null uint8
61 RoofMatl_WdShake 1460 non-null uint8
62 RoofMatl_WdShngl 1460 non-null uint8
63 Foundation_BrkTil 1460 non-null uint8
64 Foundation_CBlock 1460 non-null uint8
65 Foundation_PConc 1460 non-null uint8
66 Foundation_Slab 1460 non-null uint8
67 Foundation_Stone 1460 non-null uint8
68 Foundation_Wood 1460 non-null uint8
dtypes: float64(2), int64(30), uint8(37)
memory usage: 417.9 KB
Split into train and test data
Features normalization
Bringing features onto the same scale
Rescale features in a way that they are more comparable. Normalization of some features can improve performance.
As we have only numeric values (int64, uint8 and float64) we can apply rescaling on all of the data
Let's apply on our dataset min-max normalization
Our data is prepared (X_train_mm X_test_mm ) so let's use it for further choosing subsets and analysis.
RMSLE
3. Finding some suitable subset of features
Modeling
Let's see the performance of our model without features' normalization. Let's drop 'Id' because it won't give us any useful insights as it is only for domain knowledge
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.854
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 102.7
Date: Sun, 18 Dec 2022 Prob (F-statistic): 0.00
Time: 22:46:11 Log-Likelihood: -12833.
No. Observations: 1095 AIC: 2.579e+04
Df Residuals: 1035 BIC: 2.609e+04
Df Model: 59
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const -4.557e+05 5.97e+05 -0.763 0.445 -1.63e+06 7.16e+05
MSSubClass -97.6418 66.895 -1.460 0.145 -228.908 33.624
LotFrontage 112.4813 58.732 1.915 0.056 -2.767 227.729
LotArea 0.3248 0.143 2.275 0.023 0.045 0.605
OverallQual 1.499e+04 1294.291 11.581 0.000 1.24e+04 1.75e+04
OverallCond 5123.5577 1090.803 4.697 0.000 2983.120 7263.996
YearBuilt 309.4841 75.635 4.092 0.000 161.069 457.899
YearRemodAdd 176.8744 70.172 2.521 0.012 39.179 314.570
MasVnrArea 23.1137 6.737 3.431 0.001 9.895 36.333
BsmtFinSF1 23.0943 2.856 8.086 0.000 17.490 28.699
BsmtFinSF2 0.7477 5.060 0.148 0.883 -9.182 10.678
BsmtUnfSF 2.8944 2.742 1.056 0.291 -2.486 8.274
TotalBsmtSF 26.7365 4.239 6.307 0.000 18.419 35.054
2ndFlrSF 12.8594 6.268 2.051 0.040 0.559 25.159
LowQualFinSF -19.2821 20.991 -0.919 0.359 -60.471 21.907
GrLivArea 49.3375 6.002 8.221 0.000 37.561 61.114
BsmtFullBath 3020.6268 2749.007 1.099 0.272 -2373.636 8414.889
BsmtHalfBath -5367.2767 4323.250 -1.241 0.215 -1.39e+04 3116.058
FullBath -572.0154 2974.897 -0.192 0.848 -6409.532 5265.501
HalfBath -746.1533 2799.726 -0.267 0.790 -6239.939 4747.633
BedroomAbvGr -7412.1109 1678.142 -4.417 0.000 -1.07e+04 -4119.162
KitchenAbvGr -1.955e+04 6995.571 -2.794 0.005 -3.33e+04 -5820.694
Fireplaces 3918.2004 1837.355 2.133 0.033 312.834 7523.567
GarageCars 7514.3419 1800.351 4.174 0.000 3981.588 1.1e+04
WoodDeckSF 10.3048 8.134 1.267 0.205 -5.656 26.265
OpenPorchSF -10.9214 14.996 -0.728 0.467 -40.347 18.504
EnclosedPorch -10.0752 17.382 -0.580 0.562 -44.182 24.032
3SsnPorch 44.3336 29.950 1.480 0.139 -14.436 103.103
ScreenPorch 40.2129 17.684 2.274 0.023 5.512 74.914
PoolArea 83.2613 24.208 3.439 0.001 35.759 130.763
MoSold -300.9249 357.831 -0.841 0.401 -1003.082 401.232
YrSold 12.3983 727.636 0.017 0.986 -1415.413 1440.209
LotShape_IR1 -1.156e+05 1.49e+05 -0.774 0.439 -4.09e+05 1.77e+05
LotShape_IR2 -1.062e+05 1.49e+05 -0.711 0.477 -3.99e+05 1.87e+05
LotShape_IR3 -1.175e+05 1.49e+05 -0.786 0.432 -4.11e+05 1.76e+05
LotShape_Reg -1.165e+05 1.49e+05 -0.780 0.436 -4.1e+05 1.77e+05
LotConfig_Corner -9.142e+04 1.19e+05 -0.765 0.444 -3.26e+05 1.43e+05
LotConfig_CulDSac -8.063e+04 1.2e+05 -0.673 0.501 -3.16e+05 1.54e+05
LotConfig_FR2 -9.457e+04 1.2e+05 -0.790 0.430 -3.29e+05 1.4e+05
LotConfig_FR3 -1.004e+05 1.2e+05 -0.837 0.403 -3.36e+05 1.35e+05
LotConfig_Inside -8.875e+04 1.19e+05 -0.743 0.458 -3.23e+05 1.46e+05
LandSlope_Gtl -1.581e+05 1.99e+05 -0.795 0.427 -5.48e+05 2.32e+05
LandSlope_Mod -1.442e+05 1.99e+05 -0.726 0.468 -5.34e+05 2.46e+05
LandSlope_Sev -1.534e+05 2e+05 -0.767 0.443 -5.46e+05 2.39e+05
BldgType_1Fam -9.158e+04 1.2e+05 -0.765 0.444 -3.26e+05 1.43e+05
BldgType_2fmCon -7.853e+04 1.19e+05 -0.657 0.511 -3.13e+05 1.56e+05
BldgType_Duplex -8.666e+04 1.19e+05 -0.727 0.467 -3.21e+05 1.47e+05
BldgType_Twnhs -1.017e+05 1.2e+05 -0.849 0.396 -3.37e+05 1.33e+05
BldgType_TwnhsE -9.728e+04 1.2e+05 -0.813 0.416 -3.32e+05 1.38e+05
RoofStyle_Flat -4.76e+04 1.02e+05 -0.466 0.642 -2.48e+05 1.53e+05
RoofStyle_Gable -8.709e+04 9.98e+04 -0.873 0.383 -2.83e+05 1.09e+05
RoofStyle_Gambrel -7.576e+04 1e+05 -0.757 0.449 -2.72e+05 1.21e+05
RoofStyle_Hip -8.149e+04 9.98e+04 -0.816 0.414 -2.77e+05 1.14e+05
RoofStyle_Mansard -7.769e+04 1.01e+05 -0.772 0.441 -2.75e+05 1.2e+05
RoofStyle_Shed -8.609e+04 1.03e+05 -0.840 0.401 -2.87e+05 1.15e+05
RoofMatl_ClyTile -6.344e+05 9.46e+04 -6.703 0.000 -8.2e+05 -4.49e+05
RoofMatl_CompShg 4.428e+04 8.65e+04 0.512 0.609 -1.25e+05 2.14e+05
RoofMatl_Membran 4.783e-12 5.12e-12 0.934 0.351 -5.27e-12 1.48e-11
RoofMatl_Metal 5633.3380 9.27e+04 0.061 0.952 -1.76e+05 1.87e+05
RoofMatl_Roll 3.94e+04 9.01e+04 0.437 0.662 -1.37e+05 2.16e+05
RoofMatl_Tar&Grv -9691.2910 9.01e+04 -0.108 0.914 -1.86e+05 1.67e+05
RoofMatl_WdShake 2.267e+04 8.72e+04 0.260 0.795 -1.48e+05 1.94e+05
RoofMatl_WdShngl 7.633e+04 8.71e+04 0.876 0.381 -9.46e+04 2.47e+05
Foundation_BrkTil -7.564e+04 9.92e+04 -0.763 0.446 -2.7e+05 1.19e+05
Foundation_CBlock -8.132e+04 9.94e+04 -0.818 0.414 -2.76e+05 1.14e+05
Foundation_PConc -6.882e+04 9.95e+04 -0.692 0.489 -2.64e+05 1.26e+05
Foundation_Slab -4.353e+04 1e+05 -0.435 0.663 -2.4e+05 1.53e+05
Foundation_Stone -6.328e+04 1e+05 -0.630 0.529 -2.6e+05 1.34e+05
Foundation_Wood -1.231e+05 1.01e+05 -1.217 0.224 -3.22e+05 7.54e+04
==============================================================================
Omnibus: 444.157 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45580.718
Skew: -0.876 Prob(JB): 0.00
Kurtosis: 34.559 Cond. No. 1.37e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.44e-21. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
RMSLE is usually used when you don't want to penalize huge differences in the predicted and true values when both predicted and true values are huge numbers. In these cases only the percentual differences matter since you can rewrite.
On our train dataset RMSLE is 0.177437 which is not a bad result.
On our test dataset we can mention that it's greater so that means that on the train data our model performs better.
Model after feature normalization
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.854
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 102.7
Date: Sun, 18 Dec 2022 Prob (F-statistic): 0.00
Time: 22:46:11 Log-Likelihood: -12833.
No. Observations: 1095 AIC: 2.579e+04
Df Residuals: 1035 BIC: 2.609e+04
Df Model: 59
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const -5.474e+04 8025.930 -6.821 0.000 -7.05e+04 -3.9e+04
MSSubClass -1.66e+04 1.14e+04 -1.460 0.145 -3.89e+04 5716.151
LotFrontage 3.284e+04 1.71e+04 1.915 0.056 -807.934 6.65e+04
LotArea 6.949e+04 3.05e+04 2.275 0.023 9552.338 1.29e+05
OverallQual 1.349e+05 1.16e+04 11.581 0.000 1.12e+05 1.58e+05
OverallCond 4.099e+04 8726.426 4.697 0.000 2.39e+04 5.81e+04
YearBuilt 4.271e+04 1.04e+04 4.092 0.000 2.22e+04 6.32e+04
YearRemodAdd 1.061e+04 4210.318 2.521 0.012 2350.731 1.89e+04
MasVnrArea 3.185e+04 9283.134 3.431 0.001 1.36e+04 5.01e+04
BsmtFinSF1 1.486e+05 1.69e+04 8.820 0.000 1.16e+05 1.82e+05
BsmtFinSF2 4490.5900 7718.923 0.582 0.561 -1.07e+04 1.96e+04
BsmtUnfSF 1.432e+04 7783.924 1.840 0.066 -951.552 2.96e+04
TotalBsmtSF 1.436e+05 1.74e+04 8.273 0.000 1.1e+05 1.78e+05
2ndFlrSF 2.655e+04 1.29e+04 2.051 0.040 1155.090 5.2e+04
LowQualFinSF -1.103e+04 1.2e+04 -0.919 0.359 -3.46e+04 1.25e+04
GrLivArea 2.619e+05 3.19e+04 8.221 0.000 1.99e+05 3.24e+05
BsmtFullBath 9061.8803 8247.021 1.099 0.272 -7120.907 2.52e+04
BsmtHalfBath -1.073e+04 8646.500 -1.241 0.215 -2.77e+04 6232.117
FullBath -1716.0463 8924.690 -0.192 0.848 -1.92e+04 1.58e+04
HalfBath -1492.3066 5599.451 -0.267 0.790 -1.25e+04 9495.265
BedroomAbvGr -5.93e+04 1.34e+04 -4.417 0.000 -8.56e+04 -3.3e+04
KitchenAbvGr -5.864e+04 2.1e+04 -2.794 0.005 -9.98e+04 -1.75e+04
Fireplaces 1.175e+04 5512.066 2.133 0.033 938.501 2.26e+04
GarageCars 3.006e+04 7201.402 4.174 0.000 1.59e+04 4.42e+04
WoodDeckSF 8831.2375 6970.586 1.267 0.205 -4846.855 2.25e+04
OpenPorchSF -5973.9899 8202.585 -0.728 0.467 -2.21e+04 1.01e+04
EnclosedPorch -5561.5233 9594.657 -0.580 0.562 -2.44e+04 1.33e+04
3SsnPorch 2.252e+04 1.52e+04 1.480 0.139 -7333.378 5.24e+04
ScreenPorch 1.93e+04 8488.473 2.274 0.023 2645.596 3.6e+04
PoolArea 6.145e+04 1.79e+04 3.439 0.001 2.64e+04 9.65e+04
MoSold -3310.1737 3936.141 -0.841 0.401 -1.1e+04 4413.553
YrSold 49.5934 2910.546 0.017 0.986 -5661.650 5760.837
LotShape_IR1 -1.535e+04 4024.924 -3.813 0.000 -2.32e+04 -7450.817
LotShape_IR2 -5934.8318 5407.570 -1.098 0.273 -1.65e+04 4676.220
LotShape_IR3 -1.724e+04 9676.619 -1.781 0.075 -3.62e+04 1751.084
LotShape_Reg -1.622e+04 4020.877 -4.034 0.000 -2.41e+04 -8330.461
LotConfig_Corner -1.122e+04 4242.135 -2.645 0.008 -1.95e+04 -2894.732
LotConfig_CulDSac -430.3910 4986.364 -0.086 0.931 -1.02e+04 9354.145
LotConfig_FR2 -1.437e+04 5598.170 -2.567 0.010 -2.54e+04 -3382.943
LotConfig_FR3 -2.018e+04 1.48e+04 -1.367 0.172 -4.91e+04 8779.512
LotConfig_Inside -8547.8477 3946.265 -2.166 0.031 -1.63e+04 -804.254
LandSlope_Gtl -2.441e+04 6409.552 -3.809 0.000 -3.7e+04 -1.18e+04
LandSlope_Mod -1.056e+04 6515.661 -1.620 0.105 -2.33e+04 2228.356
LandSlope_Sev -1.977e+04 1.18e+04 -1.679 0.093 -4.29e+04 3330.700
BldgType_1Fam -1.138e+04 5599.283 -2.033 0.042 -2.24e+04 -396.184
BldgType_2fmCon 1671.2738 7486.530 0.223 0.823 -1.3e+04 1.64e+04
BldgType_Duplex -6462.8056 6609.474 -0.978 0.328 -1.94e+04 6506.691
BldgType_Twnhs -2.149e+04 5928.854 -3.624 0.000 -3.31e+04 -9854.174
BldgType_TwnhsE -1.708e+04 4388.431 -3.892 0.000 -2.57e+04 -8466.684
RoofStyle_Flat 1.923e+04 2.67e+04 0.720 0.472 -3.32e+04 7.17e+04
RoofStyle_Gable -2.026e+04 7804.234 -2.596 0.010 -3.56e+04 -4947.399
RoofStyle_Gambrel -8929.7531 1.17e+04 -0.761 0.447 -3.2e+04 1.41e+04
RoofStyle_Hip -1.466e+04 8001.473 -1.832 0.067 -3.04e+04 1040.733
RoofStyle_Mansard -1.086e+04 1.4e+04 -0.774 0.439 -3.84e+04 1.67e+04
RoofStyle_Shed -1.926e+04 2.2e+04 -0.876 0.381 -6.24e+04 2.39e+04
RoofMatl_ClyTile -5.771e+05 3.75e+04 -15.392 0.000 -6.51e+05 -5.04e+05
RoofMatl_CompShg 1.016e+05 1.3e+04 7.836 0.000 7.61e+04 1.27e+05
RoofMatl_Membran -8.077e-12 2.1e-12 -3.840 0.000 -1.22e-11 -3.95e-12
RoofMatl_Metal 6.292e+04 3.91e+04 1.610 0.108 -1.38e+04 1.4e+05
RoofMatl_Roll 9.668e+04 3e+04 3.219 0.001 3.77e+04 1.56e+05
RoofMatl_Tar&Grv 4.759e+04 2.63e+04 1.809 0.071 -4044.170 9.92e+04
RoofMatl_WdShake 7.996e+04 2.17e+04 3.679 0.000 3.73e+04 1.23e+05
RoofMatl_WdShngl 1.336e+05 1.93e+04 6.940 0.000 9.58e+04 1.71e+05
Foundation_BrkTil -8805.0575 5096.186 -1.728 0.084 -1.88e+04 1194.979
Foundation_CBlock -1.449e+04 4485.663 -3.230 0.001 -2.33e+04 -5687.809
Foundation_PConc -1988.5070 4717.973 -0.421 0.673 -1.12e+04 7269.376
Foundation_Slab 2.33e+04 8480.980 2.747 0.006 6657.588 3.99e+04
Foundation_Stone 3556.9364 1.28e+04 0.278 0.781 -2.16e+04 2.87e+04
Foundation_Wood -5.631e+04 1.6e+04 -3.530 0.000 -8.76e+04 -2.5e+04
==============================================================================
Omnibus: 444.157 Durbin-Watson: 2.068
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45580.718
Skew: -0.876 Prob(JB): 0.00
Kurtosis: 34.559 Cond. No. 1.07e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.41e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Lasso
Lasso Regression is a model that uses L1 norm that promotes sparsity of features. It can be used for Feature selection because it shrinks the coefficients of useless features to 0.
Let's choose alpha = 10 , when alpha is 0, Lasso regression produces the same coefficients as a linear regression. When alpha is very very large, all coefficients are zero.
Let's see how our model works when dropping these two columns
61
Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
'TotalBsmtSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath',
'BsmtHalfBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
'Fireplaces', 'GarageCars', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MoSold',
'YrSold', 'LotShape_IR1', 'LotShape_IR2', 'LotShape_IR3',
'LotShape_Reg', 'LotConfig_CulDSac', 'LotConfig_FR2', 'LotConfig_FR3',
'LotConfig_Inside', 'LandSlope_Gtl', 'LandSlope_Mod', 'BldgType_2fmCon',
'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
'RoofStyle_Flat', 'RoofStyle_Gable', 'RoofStyle_Gambrel',
'RoofStyle_Hip', 'RoofStyle_Mansard', 'RoofStyle_Shed',
'RoofMatl_ClyTile', 'RoofMatl_CompShg', 'RoofMatl_Roll',
'RoofMatl_Tar&Grv', 'RoofMatl_WdShake', 'RoofMatl_WdShngl',
'Foundation_BrkTil', 'Foundation_CBlock', 'Foundation_PConc',
'Foundation_Slab', 'Foundation_Stone', 'Foundation_Wood'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1095 entries, 1023 to 1126
Data columns (total 68 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 1095 non-null float64
1 LotFrontage 1095 non-null float64
2 LotArea 1095 non-null float64
3 OverallQual 1095 non-null float64
4 OverallCond 1095 non-null float64
5 YearBuilt 1095 non-null float64
6 YearRemodAdd 1095 non-null float64
7 MasVnrArea 1095 non-null float64
8 BsmtFinSF1 1095 non-null float64
9 BsmtFinSF2 1095 non-null float64
10 BsmtUnfSF 1095 non-null float64
11 TotalBsmtSF 1095 non-null float64
12 2ndFlrSF 1095 non-null float64
13 LowQualFinSF 1095 non-null float64
14 GrLivArea 1095 non-null float64
15 BsmtFullBath 1095 non-null float64
16 BsmtHalfBath 1095 non-null float64
17 FullBath 1095 non-null float64
18 HalfBath 1095 non-null float64
19 BedroomAbvGr 1095 non-null float64
20 KitchenAbvGr 1095 non-null float64
21 Fireplaces 1095 non-null float64
22 GarageCars 1095 non-null float64
23 WoodDeckSF 1095 non-null float64
24 OpenPorchSF 1095 non-null float64
25 EnclosedPorch 1095 non-null float64
26 3SsnPorch 1095 non-null float64
27 ScreenPorch 1095 non-null float64
28 PoolArea 1095 non-null float64
29 MoSold 1095 non-null float64
30 YrSold 1095 non-null float64
31 LotShape_IR1 1095 non-null float64
32 LotShape_IR2 1095 non-null float64
33 LotShape_IR3 1095 non-null float64
34 LotShape_Reg 1095 non-null float64
35 LotConfig_Corner 1095 non-null float64
36 LotConfig_CulDSac 1095 non-null float64
37 LotConfig_FR2 1095 non-null float64
38 LotConfig_FR3 1095 non-null float64
39 LotConfig_Inside 1095 non-null float64
40 LandSlope_Gtl 1095 non-null float64
41 LandSlope_Mod 1095 non-null float64
42 LandSlope_Sev 1095 non-null float64
43 BldgType_1Fam 1095 non-null float64
44 BldgType_2fmCon 1095 non-null float64
45 BldgType_Duplex 1095 non-null float64
46 BldgType_Twnhs 1095 non-null float64
47 BldgType_TwnhsE 1095 non-null float64
48 RoofStyle_Flat 1095 non-null float64
49 RoofStyle_Gable 1095 non-null float64
50 RoofStyle_Gambrel 1095 non-null float64
51 RoofStyle_Hip 1095 non-null float64
52 RoofStyle_Mansard 1095 non-null float64
53 RoofStyle_Shed 1095 non-null float64
54 RoofMatl_ClyTile 1095 non-null float64
55 RoofMatl_CompShg 1095 non-null float64
56 RoofMatl_Membran 1095 non-null float64
57 RoofMatl_Metal 1095 non-null float64
58 RoofMatl_Roll 1095 non-null float64
59 RoofMatl_Tar&Grv 1095 non-null float64
60 RoofMatl_WdShake 1095 non-null float64
61 RoofMatl_WdShngl 1095 non-null float64
62 Foundation_BrkTil 1095 non-null float64
63 Foundation_CBlock 1095 non-null float64
64 Foundation_PConc 1095 non-null float64
65 Foundation_Slab 1095 non-null float64
66 Foundation_Stone 1095 non-null float64
67 Foundation_Wood 1095 non-null float64
dtypes: float64(68)
memory usage: 590.3 KB
Let's check how the performance of our model changes when we drop those columns.
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.854
Model: OLS Adj. R-squared: 0.846
Method: Least Squares F-statistic: 104.5
Date: Sun, 18 Dec 2022 Prob (F-statistic): 0.00
Time: 22:46:12 Log-Likelihood: -12834.
No. Observations: 1095 AIC: 2.579e+04
Df Residuals: 1036 BIC: 2.608e+04
Df Model: 58
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const -4.153e+04 2.75e+04 -1.509 0.132 -9.55e+04 1.25e+04
MSSubClass -1.667e+04 1.14e+04 -1.468 0.143 -3.9e+04 5620.477
LotFrontage 3.289e+04 1.71e+04 1.919 0.055 -747.096 6.65e+04
LotArea 6.913e+04 3.05e+04 2.269 0.023 9335.701 1.29e+05
OverallQual 1.349e+05 1.16e+04 11.585 0.000 1.12e+05 1.58e+05
OverallCond 4.097e+04 8722.006 4.698 0.000 2.39e+04 5.81e+04
YearBuilt 4.205e+04 9862.884 4.264 0.000 2.27e+04 6.14e+04
YearRemodAdd 1.053e+04 4186.932 2.515 0.012 2315.037 1.87e+04
MasVnrArea 3.192e+04 9271.814 3.443 0.001 1.37e+04 5.01e+04
BsmtFinSF1 1.14e+05 1.99e+04 5.730 0.000 7.5e+04 1.53e+05
BsmtFinSF2 -2392.2901 7774.032 -0.308 0.758 -1.76e+04 1.29e+04
TotalBsmtSF 1.812e+05 3.3e+04 5.487 0.000 1.16e+05 2.46e+05
2ndFlrSF 2.634e+04 1.29e+04 2.043 0.041 1046.825 5.16e+04
LowQualFinSF -1.093e+04 1.2e+04 -0.912 0.362 -3.45e+04 1.26e+04
GrLivArea 2.604e+05 3.09e+04 8.431 0.000 2e+05 3.21e+05
BsmtFullBath 9282.8981 8162.727 1.137 0.256 -6734.465 2.53e+04
BsmtHalfBath -1.07e+04 8640.148 -1.238 0.216 -2.77e+04 6258.258
HalfBath -1134.2216 5278.235 -0.215 0.830 -1.15e+04 9223.028
BedroomAbvGr -5.958e+04 1.33e+04 -4.467 0.000 -8.58e+04 -3.34e+04
KitchenAbvGr -5.916e+04 2.08e+04 -2.843 0.005 -1e+05 -1.83e+04
Fireplaces 1.175e+04 5509.324 2.132 0.033 935.334 2.26e+04
GarageCars 3.001e+04 7193.740 4.172 0.000 1.59e+04 4.41e+04
WoodDeckSF 8799.7839 6965.427 1.263 0.207 -4868.169 2.25e+04
OpenPorchSF -6078.9512 8180.597 -0.743 0.458 -2.21e+04 9973.477
EnclosedPorch -5619.4930 9585.461 -0.586 0.558 -2.44e+04 1.32e+04
3SsnPorch 2.239e+04 1.52e+04 1.474 0.141 -7419.336 5.22e+04
ScreenPorch 1.93e+04 8484.521 2.275 0.023 2651.516 3.59e+04
PoolArea 6.147e+04 1.79e+04 3.442 0.001 2.64e+04 9.65e+04
MoSold -3295.2923 3933.550 -0.838 0.402 -1.1e+04 4423.342
YrSold 20.4672 2905.250 0.007 0.994 -5680.378 5721.313
LotShape_IR1 -1.204e+04 7494.152 -1.607 0.108 -2.67e+04 2664.705
LotShape_IR2 -2644.3012 8482.533 -0.312 0.755 -1.93e+04 1.4e+04
LotShape_IR3 -1.394e+04 1.2e+04 -1.163 0.245 -3.75e+04 9580.677
LotShape_Reg -1.29e+04 7521.267 -1.716 0.087 -2.77e+04 1855.688
LotConfig_CulDSac 1.083e+04 4542.717 2.383 0.017 1911.864 1.97e+04
LotConfig_FR2 -3200.2269 5699.725 -0.561 0.575 -1.44e+04 7984.094
LotConfig_FR3 -9111.8866 1.8e+04 -0.507 0.612 -4.44e+04 2.61e+04
LotConfig_Inside 2646.5586 2546.418 1.039 0.299 -2350.166 7643.283
LandSlope_Gtl -4752.4873 1.7e+04 -0.280 0.779 -3.81e+04 2.85e+04
LandSlope_Mod 9117.1510 1.7e+04 0.535 0.593 -2.43e+04 4.26e+04
BldgType_2fmCon 1.309e+04 1.15e+04 1.135 0.257 -9530.708 3.57e+04
BldgType_Duplex 4850.8069 8517.549 0.570 0.569 -1.19e+04 2.16e+04
BldgType_Twnhs -1.013e+04 9589.865 -1.056 0.291 -2.89e+04 8690.196
BldgType_TwnhsE -5706.7212 7826.375 -0.729 0.466 -2.11e+04 9650.634
RoofStyle_Flat 2.14e+04 2.44e+04 0.877 0.381 -2.65e+04 6.93e+04
RoofStyle_Gable -1.816e+04 1.02e+04 -1.783 0.075 -3.81e+04 1821.111
RoofStyle_Gambrel -6800.5824 1.35e+04 -0.505 0.614 -3.33e+04 1.96e+04
RoofStyle_Hip -1.253e+04 1.03e+04 -1.214 0.225 -3.28e+04 7725.924
RoofStyle_Mansard -8738.4561 1.56e+04 -0.561 0.575 -3.93e+04 2.18e+04
RoofStyle_Shed -1.671e+04 2.31e+04 -0.722 0.470 -6.21e+04 2.87e+04
RoofMatl_ClyTile -6.39e+05 6.32e+04 -10.108 0.000 -7.63e+05 -5.15e+05
RoofMatl_CompShg 3.875e+04 4.77e+04 0.813 0.416 -5.48e+04 1.32e+05
RoofMatl_Roll 3.396e+04 5.72e+04 0.593 0.553 -7.84e+04 1.46e+05
RoofMatl_Tar&Grv -1.534e+04 3.66e+04 -0.420 0.675 -8.71e+04 5.64e+04
RoofMatl_WdShake 1.699e+04 5.1e+04 0.333 0.739 -8.31e+04 1.17e+05
RoofMatl_WdShngl 7.1e+04 5.09e+04 1.396 0.163 -2.88e+04 1.71e+05
Foundation_BrkTil -6705.8100 6526.219 -1.028 0.304 -1.95e+04 6100.305
Foundation_CBlock -1.232e+04 6122.167 -2.012 0.045 -2.43e+04 -302.160
Foundation_PConc 169.3451 6318.902 0.027 0.979 -1.22e+04 1.26e+04
Foundation_Slab 2.546e+04 9632.586 2.643 0.008 6553.475 4.44e+04
Foundation_Stone 5822.2428 1.35e+04 0.430 0.667 -2.07e+04 3.24e+04
Foundation_Wood -5.396e+04 1.66e+04 -3.244 0.001 -8.66e+04 -2.13e+04
==============================================================================
Omnibus: 443.217 Durbin-Watson: 2.067
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45337.097
Skew: -0.873 Prob(JB): 0.00
Kurtosis: 34.474 Cond. No. 1.08e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 7.24e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
RMSLE for train data 0.17743789941261642
RMSLE for test data 0.22305345925639333
RMSLE for train data after normalization 0.17743789941194163
RMSLE for test data after normalization 0.22397431875031626
RMSLE for train data after dropping columns using Lasso 0.1768844465501454
RMSLE for test data after dropping columns using Lasso 0.22312391591555877
K Best
SelectKBest (here usually K should be selected using cross validation but for now lets say k = 30)
Features Score
3 OverallQual 1755.206996
14 GrLivArea 1009.771985
22 GarageCars 772.808931
11 TotalBsmtSF 595.077864
17 FullBath 476.288890
5 YearBuilt 403.162109
6 YearRemodAdd 390.077352
64 Foundation_PConc 340.673496
21 Fireplaces 293.341920
7 MasVnrArea 292.344268
8 BsmtFinSF1 162.638484
63 Foundation_CBlock 143.295069
23 WoodDeckSF 134.075140
12 2ndFlrSF 122.624957
1 LotFrontage 116.512764
24 OpenPorchSF 99.045405
18 HalfBath 93.664177
2 LotArea 82.940544
34 LotShape_Reg 75.175687
51 RoofStyle_Hip 63.443358
15 BsmtFullBath 62.040588
49 RoofStyle_Gable 57.456927
10 BsmtUnfSF 56.965483
31 LotShape_IR1 45.102068
62 Foundation_BrkTil 42.315967
36 LotConfig_CulDSac 30.091492
19 BedroomAbvGr 26.868815
20 KitchenAbvGr 23.724887
25 EnclosedPorch 23.256924
43 BldgType_1Fam 22.072005
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/feature_selection/_univariate_selection.py:289: RuntimeWarning: invalid value encountered in true_divide
correlation_coefficient /= X_norms
Let's randomly take 10 features which we have recently defined using K Score.
Let's check the performance of our model using linear regression
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.717
Model: OLS Adj. R-squared: 0.714
Method: Least Squares F-statistic: 274.3
Date: Sun, 18 Dec 2022 Prob (F-statistic): 1.08e-288
Time: 22:46:12 Log-Likelihood: -13196.
No. Observations: 1095 AIC: 2.641e+04
Df Residuals: 1084 BIC: 2.647e+04
Df Model: 10
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 1.758e+04 7706.381 2.281 0.023 2458.234 3.27e+04
TotalBsmtSF 1.588e+05 3.24e+04 4.902 0.000 9.52e+04 2.22e+05
YearBuilt 8.474e+04 8634.489 9.815 0.000 6.78e+04 1.02e+05
Fireplaces 4.338e+04 6853.914 6.329 0.000 2.99e+04 5.68e+04
RoofStyle_Gable -1.257e+04 3212.021 -3.913 0.000 -1.89e+04 -6266.082
BedroomAbvGr -7.796e+04 1.58e+04 -4.942 0.000 -1.09e+05 -4.7e+04
MasVnrArea 4.632e+04 1.18e+04 3.909 0.000 2.31e+04 6.96e+04
Foundation_PConc 1.866e+04 3521.505 5.298 0.000 1.17e+04 2.56e+04
FullBath 7075.5025 1.04e+04 0.684 0.494 -1.32e+04 2.74e+04
2ndFlrSF -2639.4034 1.31e+04 -0.201 0.841 -2.84e+04 2.31e+04
GrLivArea 3.969e+05 3.48e+04 11.398 0.000 3.29e+05 4.65e+05
==============================================================================
Omnibus: 644.154 Durbin-Watson: 1.993
Prob(Omnibus): 0.000 Jarque-Bera (JB): 68210.313
Skew: -1.768 Prob(JB): 0.00
Kurtosis: 41.503 Cond. No. 62.3
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
4. PCA
PCA is adopted to reduce the number of features of the dataset and simplify the learning model accordingly.
Let's create a loop which will try different quantity of components and we will find the best performance of our model checking RMSLE on train and test data.
n_components: 5
Train RMSLE: 0.29631596479097455
Test RMSLE: 0.3004673098749561
n_components: 9
Train RMSLE: 0.23983815962817726
Test RMSLE: 0.24709854140354934
n_components: 13
Train RMSLE: 0.21287268411407295
Test RMSLE: 0.24357608835562417
n_components: 17
Train RMSLE: 0.19362782729935957
Test RMSLE: 0.21442733058963706
n_components: 21
Train RMSLE: 0.20017939073884564
Test RMSLE: 0.2171358030159394
n_components: 25
Train RMSLE: 0.20517864516204262
Test RMSLE: 0.21183841332895878
n_components: 29
Train RMSLE: 0.21773920527660026
Test RMSLE: 0.21495508373697164
n_components: 33
Train RMSLE: 0.1922395958900697
Test RMSLE: 0.20160758496416994
n_components: 37
Train RMSLE: 0.19792667472296274
Test RMSLE: 0.21068714996246649
Here we can see that the more parameters PCA gets the less RMSLE we have, this means the better result we gain and our model works better.
The best performance of our model was using 33 components Train RMSLE: 0.1922395958900697 Test RMSLE: 0.20160758496416994
Discussion
Here we can see all our results
RMSLE for train data 0.17743789941261642
RMSLE for test data 0.22305345925639333
RMSLE for train data after normalization 0.17743789941194163
RMSLE for test data after normalization 0.22397431875031626
RMSLE for train data after dropping columns using Lasso 0.1768844465501454
RMSLE for test data after dropping columns using Lasso 0.22312391591555877
RMSLE for train data after using K Score 0.19385109462355005
RMSLE for test data after using K Score 0.19949123573153912
And the best result of PCA applying was with 33 components:
Train RMSLE: 0.1922395958900697
Test RMSLE: 0.20160758496416994
We can mention that on train dataset PCA result (0.1922395958900697) is approximately greater than every value (0.17743789941261642, 0.17743789941194163, 0.1768844465501454)
(except after using K Score 0.19385109462355005) -> that's not a good behavior of the model. Actually, the difference is not very noticeable, but we can see it.
And it's interesting that on test data PCA has one of the smallest RMSLE values (except using K Score)
We can see that we have the best results on our test dataset using those features which were chosen using K Score, that means such features as are: 'TotalBsmtSF','YearBuilt' ,'Fireplaces','RoofStyle_Gable', 'BedroomAbvGr','MasVnrArea','Foundation_PConc', 'FullBath','2ndFlrSF','GrLivArea ' .
We cannot unambiguously determine which of the methods worked best for us. RMSLE value actually was not very high for every method and it means that the performance of the model was not very bad.
Comments
I have tried different types of data normalization. Unfortunately, StandardScaler didn't work for me for optimizing RMSLE 'mean squared Logarithmic Error cannot be used when targets contain negative values'.
I have also tried different alpha values for Lasso regression for choosing the appropriate model but this one (alpha =10) for me was the best.
Regarding K-Best method I wanted to choose first 20 features but some of them also caused the same error for RMSLE, that's why I accurately tried adding variables by myself.
While using PCA I also got an error when calculating the RMSLE, so parameters for the loop also were chosen accurately.