Apartment Prices in Bogotá (Price - Size, Location, and Neighborhood)
In this project I am going to make a model to predict the apartment prices in Bogotá using the size, the location and the neighborhood of the apartment.
First I am going to explore the data.
Import the libraries
Collecting category_encoders
Downloading category_encoders-2.6.0-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.2/81.2 KB 18.8 MB/s eta 0:00:00
Collecting statsmodels>=0.9.0
Downloading statsmodels-0.13.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 148.6 MB/s eta 0:00:00
Collecting patsy>=0.5.1
Downloading patsy-0.5.3-py2.py3-none-any.whl (233 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 233.8/233.8 KB 57.5 MB/s eta 0:00:00
Installing collected packages: patsy, statsmodels, category_encoders
Successfully installed category_encoders-2.6.0 patsy-0.5.3 statsmodels-0.13.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Read the data.
The data was obtained from the Properarti web page.
It is necessary to use the properties that are apartments, with a surface covered major than 0, those must be located in Bogotá, the currency used is COP, the operation type is Venta and the price is different from zero. The properties selected are that which are between quantile 0.1 and 0.9, this is to remove the outliers. The duplicates are erased.
Explore
We explore the information of the dataframe.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4032 entries, 0 to 4031
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 lat 3838 non-null float64
1 lon 3838 non-null float64
2 neighborhood 3681 non-null object
3 price 4032 non-null float64
4 surface_covered 4032 non-null float64
dtypes: float64(4), object(1)
memory usage: 157.6+ KB
We draw a scatter map box to see the location of the apartment. We can see that there are zones where the prices are higher.
We calculate the correlation matrix, and then ,we make a heatmap to see which variables are correlated.
There are no strong correlation between variables.
Model
First the data is separated in the features X, surfaced covered, lat and lon; and the target y, price.
Then the X and y is split in X train, y train and X test and y test.
Build the model
First the Baseline is calculated.
Now it is calculated the mean absolute error of the baseline.
Baseline MAE: 329417994.38
Iterate
We make a pipeline with a One Hot Encoder, a Simple Imputer and Ridge.
We fit the model
Evaluate
First we predict with the X train.
It is calculated the mean absolute error of training.
Training MAE: 145976773.11
We can see the model beats the baseline in 183441221.
Finally, we evaluate the model with the test data.
It is calculated the mean absolute error of test data.
Test MAE: 155533714.88
Communicate Results
Finally, the results are communicated. We make a function that can be used to make prediction using the data that the user have.