Data Analysis Exercise #1
Requirement already satisfied: statsmodels==0.13.0 in /root/venv/lib/python3.7/site-packages (0.13.0)
Requirement already satisfied: pandas>=0.25 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.2.5)
Requirement already satisfied: scipy>=1.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.7.1)
Requirement already satisfied: numpy>=1.17 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from statsmodels==0.13.0) (1.19.5)
Requirement already satisfied: patsy>=0.5.2 in /root/venv/lib/python3.7/site-packages (from statsmodels==0.13.0) (0.5.2)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.7/py/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from pandas>=0.25->statsmodels==0.13.0) (2.8.2)
Requirement already satisfied: six in /shared-libs/python3.7/py-core/lib/python3.7/site-packages (from patsy>=0.5.2->statsmodels==0.13.0) (1.16.0)
WARNING: You are using pip version 21.2.4; however, version 21.3 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.603
Model: OLS Adj. R-squared: 0.598
Method: Least Squares F-statistic: 119.9
Date: Thu, 21 Oct 2021 Prob (F-statistic): 3.55e-136
Time: 04:52:11 Log-Likelihood: -2705.3
No. Observations: 721 AIC: 5431.
Df Residuals: 711 BIC: 5476.
Df Model: 9
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -31.5803 32.679 -0.966 0.334 -95.740 32.579
Cement 0.1143 0.013 8.791 0.000 0.089 0.140
Blast_Furnace_Slag 0.0976 0.012 8.014 0.000 0.074 0.122
Fly_Ash 0.0908 0.015 6.042 0.000 0.061 0.120
Water -0.1075 0.051 -2.120 0.034 -0.207 -0.008
Superplasticizer 0.3458 0.109 3.175 0.002 0.132 0.560
Coarse_Aggregate 0.0207 0.011 1.811 0.071 -0.002 0.043
Fine_Aggregate 0.0227 0.013 1.754 0.080 -0.003 0.048
Age 0.1082 0.006 17.218 0.000 0.096 0.121
Water_Cement_Ratio -2.8690 3.208 -0.894 0.372 -9.168 3.430
==============================================================================
Omnibus: 2.607 Durbin-Watson: 2.020
Prob(Omnibus): 0.272 Jarque-Bera (JB): 2.675
Skew: -0.142 Prob(JB): 0.263
Kurtosis: 2.907 Cond. No. 1.09e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.09e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
R^2 with all variables is 0.603. The highest p-value in this model is 0.372 from Water_Cement_Ratio.
We would want to remove variables with very high VIFs; in this case, Cement has the highest value with around 13.9.
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.560
Model: OLS Adj. R-squared: 0.555
Method: Least Squares F-statistic: 113.1
Date: Thu, 21 Oct 2021 Prob (F-statistic): 2.12e-121
Time: 04:52:11 Log-Likelihood: -2742.5
No. Observations: 721 AIC: 5503.
Df Residuals: 712 BIC: 5544.
Df Model: 8
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 184.1306 22.709 8.108 0.000 139.547 228.714
Blast_Furnace_Slag 0.0278 0.010 2.860 0.004 0.009 0.047
Fly_Ash -0.0016 0.011 -0.140 0.889 -0.024 0.021
Water -0.3006 0.048 -6.253 0.000 -0.395 -0.206
Superplasticizer 0.2791 0.114 2.441 0.015 0.055 0.504
Coarse_Aggregate -0.0452 0.009 -4.997 0.000 -0.063 -0.027
Fine_Aggregate -0.0556 0.010 -5.609 0.000 -0.075 -0.036
Age 0.1039 0.007 15.761 0.000 0.091 0.117
Water_Cement_Ratio -20.3232 2.652 -7.664 0.000 -25.530 -15.117
==============================================================================
Omnibus: 1.722 Durbin-Watson: 2.006
Prob(Omnibus): 0.423 Jarque-Bera (JB): 1.793
Skew: -0.113 Prob(JB): 0.408
Kurtosis: 2.908 Cond. No. 7.06e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.06e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Blast_Furnace_Slag 4.299342
Fly_Ash 3.111559
Water 6.169571
Superplasticizer 2.923183
Coarse_Aggregate 2.962581
Fine_Aggregate 3.858057
Age 1.103416
Water_Cement_Ratio 4.115372
dtype: float64
The VIFs have gone down to below 5 for all variables except for Water. R^2 score has gone down to 0.560, which may mean that removing Cement might not be a good idea. P-value of Fly_Ash has become extremely high. Let's check if removing other variables would lower the VIFs instead.
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.600
Model: OLS Adj. R-squared: 0.596
Method: Least Squares F-statistic: 133.7
Date: Thu, 21 Oct 2021 Prob (F-statistic): 2.75e-136
Time: 04:52:11 Log-Likelihood: -2707.6
No. Observations: 721 AIC: 5433.
Df Residuals: 712 BIC: 5474.
Df Model: 8
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -92.2626 15.809 -5.836 0.000 -123.301 -61.224
Cement 0.1262 0.012 10.747 0.000 0.103 0.149
Blast_Furnace_Slag 0.1153 0.009 12.985 0.000 0.098 0.133
Fly_Ash 0.1104 0.012 9.316 0.000 0.087 0.134
Superplasticizer 0.4698 0.092 5.099 0.000 0.289 0.651
Coarse_Aggregate 0.0399 0.007 5.761 0.000 0.026 0.054
Fine_Aggregate 0.0447 0.008 5.721 0.000 0.029 0.060
Age 0.1069 0.006 17.053 0.000 0.095 0.119
Water_Cement_Ratio -4.1994 3.154 -1.331 0.184 -10.392 1.993
==============================================================================
Omnibus: 1.661 Durbin-Watson: 2.023
Prob(Omnibus): 0.436 Jarque-Bera (JB): 1.719
Skew: -0.114 Prob(JB): 0.423
Kurtosis: 2.925 Cond. No. 5.26e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.26e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Cement 10.229122
Blast_Furnace_Slag 3.958604
Fly_Ash 3.771598
Superplasticizer 2.091367
Coarse_Aggregate 1.912545
Fine_Aggregate 2.642846
Age 1.100498
Water_Cement_Ratio 6.414560
dtype: float64
The R^2 score (0.60) has gone up compared to model #2 and is lower than model #1 by 0.003. The p-values seem to be the lowest overall so far. However, VIF score is again extremely high for Cement. Would choose model 3 over model 1 and 2. What if we take out both Cement and Water?
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.535
Model: OLS Adj. R-squared: 0.531
Method: Least Squares F-statistic: 117.4
Date: Thu, 21 Oct 2021 Prob (F-statistic): 3.05e-114
Time: 04:52:11 Log-Likelihood: -2761.8
No. Observations: 721 AIC: 5540.
Df Residuals: 713 BIC: 5576.
Df Model: 7
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 52.8025 8.867 5.955 0.000 35.393 70.212
Blast_Furnace_Slag 0.0637 0.008 7.913 0.000 0.048 0.079
Fly_Ash 0.0329 0.010 3.243 0.001 0.013 0.053
Superplasticizer 0.6818 0.097 7.033 0.000 0.491 0.872
Coarse_Aggregate -0.0027 0.006 -0.435 0.664 -0.015 0.009
Fine_Aggregate -0.0081 0.007 -1.244 0.214 -0.021 0.005
Age 0.0981 0.007 14.642 0.000 0.085 0.111
Water_Cement_Ratio -31.1876 2.056 -15.167 0.000 -35.225 -27.150
==============================================================================
Omnibus: 2.033 Durbin-Watson: 2.019
Prob(Omnibus): 0.362 Jarque-Bera (JB): 1.989
Skew: 0.129 Prob(JB): 0.370
Kurtosis: 3.001 Cond. No. 2.65e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.65e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Blast_Furnace_Slag 2.799356
Fly_Ash 2.372857
Superplasticizer 1.995456
Coarse_Aggregate 1.287210
Fine_Aggregate 1.596583
Age 1.081394
Water_Cement_Ratio 2.349018
dtype: float64
The R^2 score of this model is the lowest. Considering the p-value for model #3 is much more significant than #4, we should probably keep 'Cement', therefore, model 3 is the best so far. Going back to the result of #1 (all variables present), what if we remove Blast_Furnace_Slag as it has the second to highest VIF?
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.567
Model: OLS Adj. R-squared: 0.562
Method: Least Squares F-statistic: 116.5
Date: Thu, 21 Oct 2021 Prob (F-statistic): 5.77e-124
Time: 04:52:11 Log-Likelihood: -2736.5
No. Observations: 721 AIC: 5491.
Df Residuals: 712 BIC: 5532.
Df Model: 8
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 189.3912 18.302 10.348 0.000 153.458 225.324
Cement 0.0463 0.010 4.506 0.000 0.026 0.067
Water -0.3862 0.038 -10.038 0.000 -0.462 -0.311
Fly_Ash -0.0127 0.008 -1.585 0.113 -0.028 0.003
Superplasticizer 0.3000 0.113 2.643 0.008 0.077 0.523
Coarse_Aggregate -0.0519 0.007 -7.160 0.000 -0.066 -0.038
Fine_Aggregate -0.0664 0.007 -9.562 0.000 -0.080 -0.053
Age 0.1044 0.007 15.969 0.000 0.092 0.117
Water_Cement_Ratio -0.8828 3.338 -0.264 0.791 -7.436 5.671
==============================================================================
Omnibus: 1.762 Durbin-Watson: 2.034
Prob(Omnibus): 0.414 Jarque-Bera (JB): 1.837
Skew: -0.101 Prob(JB): 0.399
Kurtosis: 2.856 Cond. No. 5.86e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.86e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Cement 7.238329
Water 4.019712
Fly_Ash 1.591572
Superplasticizer 2.929362
Coarse_Aggregate 1.931027
Fine_Aggregate 1.927440
Age 1.103862
Water_Cement_Ratio 6.629878
dtype: float64
R^2 score is 0.567, and the p-value of Water_Cement_Ratio is quite high at 0.791. None of the variables have the VIF higher than 10 like model 3 does. However, Cement and Water_Cement_Ratio still have moderately high VIF. Last model--try removing Water_Cement_Ratio as it may be linearly dependent with Water and Cement columns.
OLS Regression Results
==============================================================================
Dep. Variable: Strength R-squared: 0.602
Model: OLS Adj. R-squared: 0.598
Method: Least Squares F-statistic: 134.8
Date: Thu, 21 Oct 2021 Prob (F-statistic): 4.40e-137
Time: 05:14:04 Log-Likelihood: -2705.7
No. Observations: 721 AIC: 5429.
Df Residuals: 712 BIC: 5471.
Df Model: 8
Covariance Type: nonrobust
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -34.8920 32.464 -1.075 0.283 -98.629 28.845
Cement 0.1215 0.010 11.897 0.000 0.101 0.142
Blast_Furnace_Slag 0.0968 0.012 7.970 0.000 0.073 0.121
Fly_Ash 0.0910 0.015 6.057 0.000 0.061 0.120
Water -0.1163 0.050 -2.341 0.020 -0.214 -0.019
Superplasticizer 0.3489 0.109 3.206 0.001 0.135 0.563
Coarse_Aggregate 0.0213 0.011 1.876 0.061 -0.001 0.044
Fine_Aggregate 0.0229 0.013 1.764 0.078 -0.003 0.048
Age 0.1086 0.006 17.320 0.000 0.096 0.121
==============================================================================
Omnibus: 2.336 Durbin-Watson: 2.023
Prob(Omnibus): 0.311 Jarque-Bera (JB): 2.409
Skew: -0.132 Prob(JB): 0.300
Kurtosis: 2.898 Cond. No. 1.09e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.09e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Cement 7.771301
Blast_Furnace_Slag 7.436248
Water 7.305806
Fly_Ash 6.088034
Superplasticizer 2.934533
Coarse_Aggregate 5.189088
Fine_Aggregate 7.309983
Age 1.105113
dtype: float64
R^2 score is 0.602, lower than model 1 but higher than model 3. P-values all seem to be fairly low and significant, and though VIF values are moderately high for majority of the variables, Cement is no longer higher than 10 in this model. With that in mind, I will set model 6 (all variables except Water_Cement_Ratio) to be my final linear regression model. According to the summary, Superplasticizer, Cement, and Age have the highest direct correlation with strength of the cement. On the other hand, Water has a reverse correlation with the dependent variable, meaning that the higher the amount of water there is, the lower the strength of the concrete is.
The OSR^2 score on the test set is 0.633, higher than R^2 which is a good indicator that the model is less likely to be overfitting.