Basics of Machine Learning
Julián Cárdenas
In this example we create a machine learning model that predicts the value of houses from some parameters ('SqFt', 'Bedrooms', 'Bathrooms', 'Offers)
Basic Data Exploration
First Machine Learning Model
Train test split
The Problem with "In-Sample" Scores The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.
Imagine that, in the large real estate market, door color is unrelated to home price.
However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.
Since this pattern was derived from the training data, the model will appear accurate in the training data.
But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.
Explanation from: https://www.kaggle.com/code/dansbecker/model-validation