Data Analysis Exercise #1
R^2 with all variables is 0.603. The highest p-value in this model is 0.372 from Water_Cement_Ratio.
We would want to remove variables with very high VIFs; in this case, Cement has the highest value with around 13.9.
The VIFs have gone down to below 5 for all variables except for Water. R^2 score has gone down to 0.560, which may mean that removing Cement might not be a good idea. P-value of Fly_Ash has become extremely high. Let's check if removing other variables would lower the VIFs instead.
The R^2 score (0.60) has gone up compared to model #2 and is lower than model #1 by 0.003. The p-values seem to be the lowest overall so far. However, VIF score is again extremely high for Cement. Would choose model 3 over model 1 and 2. What if we take out both Cement and Water?
The R^2 score of this model is the lowest. Considering the p-value for model #3 is much more significant than #4, we should probably keep 'Cement', therefore, model 3 is the best so far. Going back to the result of #1 (all variables present), what if we remove Blast_Furnace_Slag as it has the second to highest VIF?
R^2 score is 0.567, and the p-value of Water_Cement_Ratio is quite high at 0.791. None of the variables have the VIF higher than 10 like model 3 does. However, Cement and Water_Cement_Ratio still have moderately high VIF. Last model--try removing Water_Cement_Ratio as it may be linearly dependent with Water and Cement columns.
R^2 score is 0.602, lower than model 1 but higher than model 3. P-values all seem to be fairly low and significant, and though VIF values are moderately high for majority of the variables, Cement is no longer higher than 10 in this model. With that in mind, I will set model 6 (all variables except Water_Cement_Ratio) to be my final linear regression model. According to the summary, Superplasticizer, Cement, and Age have the highest direct correlation with strength of the cement. On the other hand, Water has a reverse correlation with the dependent variable, meaning that the higher the amount of water there is, the lower the strength of the concrete is.
The OSR^2 score on the test set is 0.633, higher than R^2 which is a good indicator that the model is less likely to be overfitting.