Basics of Machine Learning

Julián Cárdenas

In this example we create a machine learning model that predicts the value of houses from some parameters ('SqFt', 'Bedrooms', 'Bathrooms', 'Offers)

Basic Data Exploration

import pandas as pd

df_path = 'https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv' df = pd.read_csv(df_path) df.head()

df.shape

df.groupby('Neighborhood').count()

#Dont have null df.isnull().count()

df.describe()

First Machine Learning Model

#In case of missing values df = df.dropna(axis=0)

#In this case i want to predict the price y = df.Price

#In this case im going to work with the relevant numeric features of the dataframe df.columns

df_features = ['SqFt', 'Bedrooms', 'Bathrooms', 'Offers']

X = df[df_features] X.describe()

X.head()

#1. Define the type of model #https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html from sklearn.tree import DecisionTreeRegressor df_model = DecisionTreeRegressor(random_state=1)

#2. Fit df_model.fit(X, y)

#3. Predict print("Making predictions for the following 5 houses:") print(X.head()) print("The predictions are") print(df_model.predict(X.head()))

#4. Evaluate R^2 #https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html #R^2: Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). from sklearn.metrics import r2_score #r2_score(true_values, predict_values) r2_score(y, df_model.predict(X))

#4. Evaluate MAE #So, if a house cost $150,000 and you predicted it would cost $100,000 the error is $50,000. #https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html from sklearn.metrics import mean_absolute_error #mean_absolute_error(true_values, predict_values) mean_absolute_error(y, df_model.predict(X))

Train test split

The Problem with "In-Sample" Scores The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

However, in the sample of data you used to build the model, all homes with green doors were very expensive. The model's job is to find patterns that predict home prices, so it will see this pattern, and it will always predict high prices for homes with green doors.

Since this pattern was derived from the training data, the model will appear accurate in the training data.

But if this pattern doesn't hold when the model sees new data, the model would be very inaccurate when used in practice.

Explanation from: https://www.kaggle.com/code/dansbecker/model-validation

#https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html from sklearn.model_selection import train_test_split train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)