Basics of Machine Learning
Julián Cárdenas
In this example we create a machine learning model that predicts the value of houses from some parameters ('SqFt', 'Bedrooms', 'Bathrooms', 'Offers)
Basic Data Exploration
0
1
114300
1
2
114200
2
3
114800
3
4
94700
4
5
119800
East
45
45
North
44
44
West
39
39
count
128.0
128.0
mean
64.5
130427.34375
std
37.094473981982816
26868.770370734055
min
1.0
69100.0
25%
32.75
111325.0
50%
64.5
125950.0
75%
96.25
148250.0
max
128.0
211200.0
First Machine Learning Model
count
128.0
128.0
mean
2000.9375
3.0234375
std
211.5724313344898
0.7259513850016589
min
1450.0
2.0
25%
1880.0
3.0
50%
2000.0
3.0
75%
2140.0
3.0
max
2590.0
5.0
0
1790
2
1
2030
4
2
1740
3
3
1980
3
4
2130
3
Making predictions for the following 5 houses:
SqFt Bedrooms Bathrooms Offers
0 1790 2 2 2
1 2030 4 2 3
2 1740 3 2 1
3 1980 3 2 3
4 2130 3 3 3
The predictions are
[114300. 114200. 114800. 94700. 119800.]
Train test split
The Problem with "In-Sample" Scores The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.
Imagine that, in the large real estate market, door color is unrelated to home price.
However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.
Since this pattern was derived from the training data, the model will appear accurate in the training data.
But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.
Explanation from: https://www.kaggle.com/code/dansbecker/model-validation
20118.75