Get the data
Many histograms are tail heavy. They extend much farther to the right of the median than to the left. This may make it bit harder for some ML algorithms to detect patterns. https://www.statisticshowto.com/heavy-tailed-distribution/
Stratified Sampling is important as it guarantees that your dataset does not have an intrinsic bias and that it does represent the population. Researchers rely on stratified sampling when a population’s characteristics are diverse and they want to ensure that every characteristic is properly represented in the sample. This helps with the generalizability and validity of the study, as well as avoiding research biases like undercoverage bias.
Data Visualization
above graph explains that housing prices depends on location and also on population density. Which are near to ocean they have high prices.
The correlation coefficient measure only linear correlation. It may completely miss out on nonlinear relationships.
And also in the above values for median house value we can see the median income has high positive correlation coefficient than compared to other.
As you can see median house value is almost linearly related to median income.
Plot between median house value and median income shows that there is linear relation between both but near 500K there is price cap.
Experimenting with attribute combinations
Compared to total bedrooms new bedrooms per room is more correlated to median house value.
Data Cleaning
one hot encoding, because ML algorithm will assume 0 and 1 are similar compared to 0 and 4.
Feature Scaling
Machine Learning Algorithms don't perform well when the input numerical data have very different scale. Min-Max scaling and standardization are two common ways to get all attributes have same scale.
Model Training
This can't be the case. Maybe the model overfit. So let's use set of training data for validation set.