Apartment Prices in Bogotá (Price - Size, Location, and Neighborhood)

In this project I am going to make a model to predict the apartment prices in Bogotá using the size, the location and the neighborhood of the apartment.

First I am going to explore the data.

Import the libraries

import sys !{sys.executable} -m pip install category_encoders | grep -v 'Requirement already satisfied' from category_encoders import OneHotEncoder;

import pandas as pd import matplotlib.pyplot as plt import plotly.express as px import plotly.graph_objects as go import seaborn as sns import numpy as np from sklearn.linear_model import LinearRegression, Ridge from sklearn.metrics import mean_absolute_error from sklearn.impute import SimpleImputer from sklearn.pipeline import Pipeline, make_pipeline from sklearn.model_selection import train_test_split

Read the data.

The data was obtained from the Properarti web page.

https://www.properati.com.ec/data/assets/img/properati-data/flag-col.png

df = pd.read_csv("/work/co_properties.csv")

It is necessary to use the properties that are apartments, with a surface covered major than 0, those must be located in Bogotá, the currency used is COP, the operation type is Venta and the price is different from zero. The properties selected are that which are between quantile 0.1 and 0.9, this is to remove the outliers. The duplicates are erased.

dfs=df[df["surface_covered"]>0] dfs=dfs[dfs["l3"]=="Bogotá D.C"] apartamento=["Apartamento"] dfs = dfs[dfs["property_type"].isin(apartamento)] dfs = dfs[dfs["currency"]=="COP"] dfs = dfs[dfs["operation_type"]=="Venta"] dfs = dfs[["lat","lon","l5","price","surface_covered"] ] dfs = dfs.rename(index=str, columns={"l5":"neighborhood"}) dfs=dfs[dfs["price"]!=0] q10=dfs["surface_covered"].quantile(0.1) q90=dfs["surface_covered"].quantile(0.9) mask=(dfs["surface_covered"]<q90) & (dfs["surface_covered"]>q10) dfs=dfs[mask] dfs.drop_duplicates(inplace=True) #dfs.dropna(inplace=True) dfs.reset_index(drop=True, inplace=True)

Explore

We explore the information of the dataframe.

dfs.info()

dfs.select_dtypes("object").nunique()

We draw a scatter map box to see the location of the apartment. We can see that there are zones where the prices are higher.

fig=px.scatter_mapbox( dfs, lat="lat", lon="lon", width=500, height=500, color="price", hover_data=["price"] ) fig.update_layout(mapbox_style="open-street-map") fig.show()

We calculate the correlation matrix, and then ,we make a heatmap to see which variables are correlated.

corr=dfs.select_dtypes("number").drop(columns="price").corr()

sns.heatmap(corr)

There are no strong correlation between variables.

Model

First the data is separated in the features X, surfaced covered, lat and lon; and the target y, price.

features = ["surface_covered","lat","lon","neighborhood"] X = dfs[features] X.shape

target = "price" y = dfs[target] y.shape

Then the X and y is split in X train, y train and X test and y test.

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

Build the model

First the Baseline is calculated.

y_mean=y_train.mean()

y_pred_baseline=[y_mean]*len(y_train)

Now it is calculated the mean absolute error of the baseline.

mae_baseline = mean_absolute_error(y_pred_baseline, y_train) print("Baseline MAE:", round(mae_baseline, 2))

Iterate

We make a pipeline with a One Hot Encoder, a Simple Imputer and Ridge.

model = make_pipeline( OneHotEncoder(use_cat_names=True), SimpleImputer(), Ridge() )

We fit the model

model.fit(X_train, y_train)

Evaluate

First we predict with the X train.

y_pred_training=model.predict(X_train)

It is calculated the mean absolute error of training.

mae_training=mean_absolute_error(y_train, y_pred_training) print("Training MAE:", round(mae_training, 2))

mae_baseline-mae_training

We can see the model beats the baseline in 183441221.

Finally, we evaluate the model with the test data.

y_pred_test = pd.Series(model.predict(X_test)) y_pred_test.head()

It is calculated the mean absolute error of test data.

mae_test=mean_absolute_error(y_test, y_pred_test) print("Test MAE:", round(mae_test, 2))

Communicate Results

Finally, the results are communicated. We make a function that can be used to make prediction using the data that the user have.

def make_prediction(area, lat, lon, neighborhood): data = { "surface_covered": area, "lat": lat, "lon": lon, "neighborhood": neighborhood } df = pd.DataFrame(data, index=[0]) prediction = model.predict(df).round(2)[0] return f"Predicted apartment price: ${prediction}"

make_prediction(57, 4.69, -74.03, "Usaquén")