INTRODUCTION
In this project through EDA we expect to find correlations between variables and try to find out which are the most important of them regarding price fixing.
We also will create price predictive models using our most important variables found previously with Linear Regression method, considering one, two and multiple variables.
INSTALL LIBRARIES
!pip install --upgrade pip setuptools==57.5.0
!pip install numpy pandas matplotlib seaborn statsmodels scikit-learn regressors==0.0.3
#First, we must install all necessary libraries for this project.
IMPORT LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics as metrics
import scipy.stats
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from regressors import stats
from mpl_toolkits.mplot3d import *
#Next, we import all libraries we will use.
SET APPEARANCE FOR CHARTS
%matplotlib inline
sns.set_style(style='whitegrid')
sns.set_context(context='notebook')
plt.rcParams['figure.figsize'] = (11, 9.4)
#We establish a standard design for all our future charts we will create.
UPLOAD DATA SET
sports_cars = pd.read_csv('sports_cars.csv')
sports_cars
#We upload our data set "sports cars" downloaded from kaggle.
KNOWING OUR DATA
sports_cars.shape
#Our data set has in total 1007 rows and 8 columns.
sports_cars.dtypes
#At first glance, we can see we should turn some object type columns into numeric for better analysis.
sports_cars.isna().sum()
#There is 10 null characters for 'Engine Size (L)' column and 3 for 'Torque (lb-ft)'. Since our data
#has 1007 characters by column, it just represents around 1%, so it is almost insignificant. If we
#removed these null characters it would not affect a lot our data set.
(sports_cars.isna().melt().pipe(lambda df:(sns.displot(data=df, y='variable',
hue='value', multiple='fill', aspect=2))))
#In the next graph, we can visualize the proportion of those null characters in our data set.
(sports_cars.isna().transpose().pipe(lambda df:(sns.heatmap(data=df))))
#Also, we try to find out if those null characters belong to the same row, in this case not.
PREPARING DATA
sports_cars = sports_cars.dropna()
sports_cars.isna().sum()
#We proceed to remove null characters in our data set.
sports_cars = sports_cars.rename(columns={'Car Make':'Maker','Car Model':'Model','Engine Size (L)':
'Engine_Size','Torque (lb-ft)':'Torque','0-60 MPH Time(seconds)':
'MPH_0_60', 'Price (in USD)': 'Price_USD'})
#We rename columns where the name can be shorter or we can eliminate unnecessary characters.
sports_cars['Horsepower'] = sports_cars['Horsepower'].apply(lambda x:str(x).replace('+','').replace(',','')).astype(int)
sports_cars['Torque'] = sports_cars['Torque'].apply(lambda x:str(x).replace('+','').replace(',','').replace('-','0')).astype(int)
sports_cars['MPH_0_60'] = sports_cars['MPH_0_60'].apply(lambda x:str(x).replace('< 1.9','1.5')).astype(float)
sports_cars['Price_USD'] = sports_cars['Price_USD'].apply(lambda x:str(x).replace(',','')).astype(int)
sports_cars['Engine_Size'] = pd.to_numeric(sports_cars['Engine_Size'], errors='coerce')
sports_cars['Engine_Size'] = sports_cars['Engine_Size'].fillna(6.1)
sports_cars['Age'] = 2023 - sports_cars['Year']
sports_cars.drop('Year',axis=1,inplace=True)
#We turn object type columns into int and float depending on the data, removing also unnecessary
#characters.
#In the case of 'Engine_Size', since we have numerical and categorical data (like 'Electric' and
#'Hybrid'), we convert all categorical data to an average engine size number '6.1' for analysis
#purpose. However, there is no a properly way to compare gas cars by engine size to electric or
#hybrid, we just use averages of our data set.
#Finally, we create a new column 'Age' where we can observe the age of every car and remove the
#column 'Year'.
sns.pairplot(data=sports_cars, vars=['Engine_Size','Horsepower','Torque','MPH_0_60','Price_USD', 'Age'],
hue='Maker', kind='scatter', diag_kind='kde', corner=True)
#Checking the correlation between our columns, we found out some outliers in our data set, specifically
#in our columns 'Horsepower','Torque' and 'Age'.
sports_cars = sports_cars[(sports_cars['Horsepower'] < 10000) & (sports_cars['Torque'] < 7000)
& (sports_cars['Age'] < 20)]
#We filter our data to remove outliers.
sports_cars.info()
#In total we just removed 13 null characters and 2 outliers. Since our raw data had 1007 values by
#column, these removed 15 values only represents about 1.5% of the total, so our new data set has not
#changed a lot. Now, our data is ready to continue analyzing.
EXPLORATORY DATA ANALYSIS
sports_cars_numeric = sports_cars.select_dtypes(include=['float64', 'int64'])
sports_cars_numeric
cor = sports_cars_numeric.corr()
cor
sns.clustermap(data=cor,cmap=sns.diverging_palette(20, 230, as_cmap=True),
center=0, vmin=-1, vmax=1, linewidths=0.1, annot=True)
#Again, looking for correlations between our variables, we found out some positive and negative
#correlations.
#'Horsepower'-'Torque' = (0.93)
#'Price_USD'-'Horsepower' = (0.79)
#'Price_USD'-'Torque' = (0.73)
#'MPH_0_60'-'Torque' = (-0.68)
#'MPH_0_60'-'Horsepower' = (-0.73)
sns.scatterplot(data=sports_cars,x='Horsepower',y='Torque',hue='Engine_Size')
res_hor_tor = scipy.stats.linregress(x=sports_cars.Horsepower,y=sports_cars.Torque)
print(res_hor_tor)
fx_1 = np.array([sports_cars.Horsepower.min(),sports_cars.Horsepower.max()])
fy_1 = res_hor_tor.intercept + res_hor_tor.slope * fx_1
plt.plot(fx_1,fy_1)
#Analyzing the correlation of 'Horsepower' and 'Torque', we find out a linear regression with
#positive slope, while 'Horsepower' increases, 'Torque' variable also increases. Futhermore,
#we could verify that, most of the time, cars with bigger 'Engine_Size' variables have higher
#levels of 'Horsepower' and 'Torque'.
sns.scatterplot(data=sports_cars,x='Price_USD',y='Horsepower',hue='Torque')
res_pri_hor = scipy.stats.linregress(x=sports_cars.Price_USD,y=sports_cars.Horsepower)
print(res_pri_hor)
fx_2 = np.array([sports_cars.Price_USD.min(),sports_cars.Price_USD.max()])
fy_2 = res_pri_hor.intercept + res_pri_hor.slope * fx_2
plt.plot(fx_2,fy_2)
#In this case the slope is almost 0 what means a low change of the level of 'Horsepower' in
#response to an increase in 'Price_USD'. It is because most of the cars in our data set are
#below 1,000,000 USD(1e6). However, we still can visualize that while our 'Price_USD' variable
#increases, 'Horsepower' and 'Torque' also increase, supporting the results in our clustermap
#above (positive correlation).
sns.scatterplot(data=sports_cars,x='MPH_0_60',y='Horsepower',hue='Torque')
res_mph_hor = scipy.stats.linregress(x=sports_cars.MPH_0_60,y=sports_cars.Horsepower)
print(res_mph_hor)
fx_3 = np.array([sports_cars.MPH_0_60.min(),sports_cars.MPH_0_60.max()])
fy_3 = res_mph_hor.intercept + res_mph_hor.slope * fx_3
plt.plot(fx_3,fy_3)
#In our last graph, analyzing the correlation between 'MPH_0_60' with 'Horsepower' and 'Torque',
#we find out a linear regression with negative slope and correlation. While lower the level of
#'Horsepower' and 'Torque', it takes longer to acelerate from 0 to 60 miles per hour. So, cars
#with high levels of 'Horsepower' and 'Torque' can get to 60MPH in around 2 seconds and cars with
#low levels take above 5 seconds.
PREDICTIVE MODEL
np.random.seed(1)
x_cols = ['Horsepower','Engine_Size','MPH_0_60','Torque','Age']
y_col = ['Price_USD']
x=sports_cars[x_cols].values
y=sports_cars[y_col].values
x_train, x_test, y_train, y_test = train_test_split(x,y)
sc_x = StandardScaler().fit(x)
sc_y = StandardScaler().fit(y)
x_train = sc_x.transform(x_train)
x_test = sc_x.transform(x_test)
y_train = sc_y.transform(y_train)
y_test = sc_y.transform(y_test)
model = LinearRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
y_pred.shape
#In this stage, we create a linear regression model with all our numeric variables.
mse = metrics.mean_squared_error(y_test,y_pred)
r2 = metrics.r2_score(y_test,y_pred)
print('r2', r2.round(4))
print('mse', mse.round(4))
#Our 'r squared' is low for a predictive model, a proper one would be at least '0.75'.
model.intercept_ = model.intercept_[0]
model.coef_ = model.coef_.reshape(-1)
#Now we remodel our intercept and coef for better results.
y_test = y_test.reshape(-1)
print("==============SUMMARY==============")
stats.summary(model, x_test, y_test, x_cols)
#Now, observing 'p value' of our variables, we find out that 'Horsepower' and 'MPH_0_60' have values
#lower than 0.05 what means these are the most significant variables, so we will focus in these and
#remove the others in our next models.
residuals = np.subtract(y_test,y_pred.reshape(-1))
plt.scatter(y_pred,residuals)
plt.show()
#In our graph, we can verify that our model is not good to predict our desired variable.
sports_cars2 = sports_cars.copy()
sports_cars2.shape
#We copy our data set to create our second model.
np.random.seed(5)
x2_cols = ['Horsepower','MPH_0_60']
y2_col = ['Price_USD']
x2=sports_cars2[x2_cols].values
y2=sports_cars2[y2_col].values
x_train2, x_test2, y_train2, y_test2 = train_test_split(x2,y2)
sc_x2 = StandardScaler().fit(x2)
sc_y2 = StandardScaler().fit(y2)
x_train2 = sc_x2.transform(x_train2)
x_test2 = sc_x2.transform(x_test2)
y_train2 = sc_y2.transform(y_train2)
y_test2 = sc_y2.transform(y_test2)
model = LinearRegression(fit_intercept=False)
model.fit(x_train2,y_train2)
y_pred2 = model.predict(x_test2)
y_pred2.shape
#In our second model we use only our most significant variables 'Horsepower' and 'MPH_0_60'.
#We also disable intercept.
mse = metrics.mean_squared_error(y_test2,y_pred2)
r2 = metrics.r2_score(y_test2,y_pred2)
print('r2', r2.round(4))
print('mse', mse.round(4))
#In our second model 'r squared' improved.
model.coef_ = model.coef_.reshape(-1)
#This time we remodel only our coef.
y_test2 = y_test2.reshape(-1)
print("==============SUMMARY==============")
stats.summary(model, x_test2, y_test2, x2_cols)
#Although, with fewer variables our predictive model improved, still not qualified to be an
#excellent model.
residuals = np.subtract(y_test2,y_pred2.reshape(-1))
plt.scatter(y_pred2,residuals)
plt.show()
#We confirm what we mentioned above, although our model improved still it is limited to make predictions.
x1_range = np.arange(sports_cars2['Horsepower'].min(),sports_cars2['Horsepower'].max())
x2_range = np.arange(sports_cars2['MPH_0_60'].min(),sports_cars2['MPH_0_60'].max())
X1, X2 = np.meshgrid(x1_range,x2_range)
plain = pd.DataFrame({'Horsepower':X1.ravel(),'MPH_0_60':X2.ravel()})
pred = model.predict(plain).reshape(X1.shape)
pred = sc_y2.inverse_transform(pred)
fig = plt.figure()
ax = fig.add_subplot(projection='3d')
ax.plot_surface(X1,X2,pred,alpha=0.4)
ax.scatter3D(sports_cars2['Horsepower'],sports_cars2['MPH_0_60'],sports_cars2['Price_USD'],c=y,marker='.')
ax.view_init(elev=20,azim=250)
plt.show()
#We create a 3D graph to observe the correlation of our most significant variables again, 'Horsepower',
#'MPH_0_60' and 'Price_USD'. So, we can confirm, at least, that while cheaper the price, the horsepower
#is lower and the time to accelerate from 0 to 60 MPH takes longer. The opposite, most of the time, is
#true accordingly.
np.random.seed(5)
x3_cols = ['Horsepower']
y3_col = ['Price_USD']
x3=sports_cars2[x3_cols].values
y3=sports_cars2[y3_col].values
x_train3, x_test3, y_train3, y_test3 = train_test_split(x3,y3)
sc_x3 = StandardScaler().fit(x3)
sc_y3 = StandardScaler().fit(y3)
x_train3 = sc_x3.transform(x_train3)
x_test3 = sc_x3.transform(x_test3)
y_train3 = sc_y3.transform(y_train3)
y_test3 = sc_y3.transform(y_test3)
model = LinearRegression(fit_intercept=False)
model.fit(x_train3,y_train3)
y_pred3 = model.predict(x_test3)
y_pred3.shape
#Finally, we create a third model using our most significant variable 'Horsepower' to predict prices.
mse = metrics.mean_squared_error(y_test3,y_pred3)
r2 = metrics.r2_score(y_test3,y_pred3)
print('r2', r2.round(4))
print('mse', mse.round(4))
#Although we use only one variable, our 'r squared' is acceptable.
Horsepower_pred0 = 450
car_price_pred0 = sc_x3.transform(np.array([Horsepower_pred0]).reshape(-1,1))
print("The price for a car with a 450 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred0)),"USD.")
Horsepower_pred1 = 500
car_price_pred1 = sc_x3.transform(np.array([Horsepower_pred1]).reshape(-1,1))
print("The price for a car with a 500 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred1)),"USD.")
Horsepower_pred2 = 600
car_price_pred2 = sc_x3.transform(np.array([Horsepower_pred2]).reshape(-1,1))
print("The price for a car with a 600 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred2)),"USD.")
Horsepower_pred3 = 700
car_price_pred3 = sc_x3.transform(np.array([Horsepower_pred3]).reshape(-1,1))
print("The price for a car with a 700 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred3)),"USD.")
Horsepower_pred4 = 800
car_price_pred4 = sc_x3.transform(np.array([Horsepower_pred4]).reshape(-1,1))
print("The price for a car with a 800 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred4)),"USD.")
Horsepower_pred5 = 900
car_price_pred5 = sc_x3.transform(np.array([Horsepower_pred5]).reshape(-1,1))
print("The price for a car with a 900 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred5)),"USD.")
Horsepower_pred6 = 1000
car_price_pred6 = sc_x3.transform(np.array([Horsepower_pred6]).reshape(-1,1))
print("The price for a car with a 1000 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred6)),"USD.")
Horsepower_pred7 = 1500
car_price_pred7 = sc_x3.transform(np.array([Horsepower_pred7]).reshape(-1,1))
print("The price for a car with a 1500 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred7)),"USD.")
Horsepower_pred8 = 2000
car_price_pred8 = sc_x3.transform(np.array([Horsepower_pred8]).reshape(-1,1))
print("The price for a car with a 2000 horsepower would be",sc_y3.inverse_transform(model.predict(car_price_pred8)),"USD.")
#In our last model, we can observe that when using low 'horsepower' levels the price is low. While
#this variable increases, the price also increases accordingly. However, since our 'r squared' is
#only '0.74', accuarecy is not too high. When we use lower values than 450, our results are not good.
CONCLUSION
Assuming all cars are new in our data set, the variable 'Age' has almost no correlation to car prices.
While bigger the 'Horsepower' and lower the time to take to accelerate from 0 to 60 MPH, car prices increase.
Our variable 'Engine_Size' had categorical and numerical variables. Although we used an average value of '6.1' for 'Electric' and 'hybrid' cars, it is not a proper way to compare them to gas cars. So, we should use this variable like categorical for better analysis or convert to boolean type (e.g. Gas and not Gas).
Since we are not considering the variable 'Maker' in our models, those are limited. Brand is an important variable for car prices, specially for 'sports cars'. So, we should use other models different to lineal regression to use this categorical variable and get better predictive models.