Apartment Prices in Bogotá (Price - Size, Location, and Neighborhood)
In this project I am going to make a model to predict the apartment prices in Bogotá using the size, the location and the neighborhood of the apartment.
First I am going to explore the data.
Import the libraries
Collecting category_encoders Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.2/81.2 KB 18.8 MB/s eta 0:00:00 Collecting statsmodels>=0.9.0 Downloading statsmodels-0.13.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 148.6 MB/s eta 0:00:00 Collecting patsy>=0.5.1 Downloading patsy-0.5.3-py2.py3-none-any.whl (233 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.8/233.8 KB 57.5 MB/s eta 0:00:00 Installing collected packages: patsy, statsmodels, category_encoders Successfully installed category_encoders-2.6.0 patsy-0.5.3 statsmodels-0.13.5 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Read the data.
The data was obtained from the Properarti web page.
It is necessary to use the properties that are apartments, with a surface covered major than 0, those must be located in Bogotá, the currency used is COP, the operation type is Venta and the price is different from zero. The properties selected are that which are between quantile 0.1 and 0.9, this is to remove the outliers. The duplicates are erased.
We explore the information of the dataframe.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4032 entries, 0 to 4031 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lat 3838 non-null float64 1 lon 3838 non-null float64 2 neighborhood 3681 non-null object 3 price 4032 non-null float64 4 surface_covered 4032 non-null float64 dtypes: float64(4), object(1) memory usage: 157.6+ KB
We draw a scatter map box to see the location of the apartment. We can see that there are zones where the prices are higher.
We calculate the correlation matrix, and then ,we make a heatmap to see which variables are correlated.
There are no strong correlation between variables.
First the data is separated in the features X, surfaced covered, lat and lon; and the target y, price.
Then the X and y is split in X train, y train and X test and y test.
Build the model
First the Baseline is calculated.
Now it is calculated the mean absolute error of the baseline.
Baseline MAE: 329417994.38
We make a pipeline with a One Hot Encoder, a Simple Imputer and Ridge.
We fit the model
First we predict with the X train.
It is calculated the mean absolute error of training.
Training MAE: 145976773.11
We can see the model beats the baseline in 183441221.
Finally, we evaluate the model with the test data.
It is calculated the mean absolute error of test data.
Test MAE: 155533714.88
Finally, the results are communicated. We make a function that can be used to make prediction using the data that the user have.