Optimization - Linear Regression from scratch pt.2
Import the dataset
https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
As we can see, we have 13 different columns containing some car caractéristics. There are Categorical variables, but also Numerical values.
Pre-processing operations
Missing values management
deletion of the null values.
Removing units
Here, we remove the units in order to handle the column as float columns
Then we remove unusable variables
Keeping only the main fuel categories
Convert str types columns to int
We use get dummies in order to keep only numerical values in our dataframe
drop_first = true allows us to get a better result, we can store the same informations but with 4 columns less
Split our dataframe in two. with:
x ==> data
y ==> label
Only the values in the DataFrame will be returned, the axes labels will be removed.
Data standardization
We standardize x, in order to eliminate order of magnitude differences between the values of the different columns
Train Test split
Our test dataframe represents 25%
Models creation
First, Let's try to implement some models in order to highlight the most efficient
1 - Linear Regression (67%)
We can see that the model is not overfitting