Lab 7 - Machine Learning (Regression)
Part I: Answer the following questions:
1. How can we use regression analysis to identify the key factors contributing to the target variable?
To identify the relationship between variables, we can use the correlation coefficient to measure the strength and direction of its relationship.
2. What are the assumptions that linear regression has on data?
Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recorded as binary (dummy) variables or other types of contrast variables.
3. What is collinearity? What will impact the regression model if the data contains collinear columns?
Collonaarity is a case when two or more variables are correlated.
If the data contains collinear columns, the coefficient estimates of independent variables would be very sensitive to the change in the model and would inflate the variance and standard error of the coefficient estimates.
4. What is the interaction effect in regression analysis? How can we test for the impact of the interaction terms in the regression model?
The interaction effect is a case when the effect of an independent variable on dependent variable changes, depending on the value(s) of one or more other independent variables.
We can test for the effect of interaction terms by using an interaction plot and comparing the lines from the interaction plot; If the lines are parallel, there is no interaction, If not then there is an interaction.
5. Regression Through Origin is a type of regression model where the intercept term is set to zero (or without the intercept term). Discuss the effect of using such a model compared to a model with an intercept term, and when it should be used.
6. How can regression analysis be used in the feature selection process?
Regression analysis can be used in the feature selection process to identify the most important variables that are related to the response variables by using correlation analysis to remove the non-relate variables.
8. What are two methods that can be used to encode a categorical variable into a numerical variable? When should each be used?
9. Give an example of common metrics that are used to evaluate the performance of the regression model.
The larger coefficient isn't necessarily more important than another one because each coefficient represents the change in different types of variables or measures in a different unit.
Part II: Use the used_car_price dataset from here. Create a regression model to predict vehicle sale price from given attributes (some feature engineering is required).
The target column is price_usd.
0
Honda
Amaze 1.2 VX i-VTEC
1
Maruti Suzuki
Swift DZire VDI
2
Hyundai
i10 Magna 1.2 Kappa2
3
Toyota
Glanza G
4
Toyota
Innova 2.4 VX 7 STR [2016-2020]
5
Maruti Suzuki
Ciaz ZXi
6
Mercedes-Benz
CLA 200 Petrol Sport
7
BMW
X1 xDrive20d M Sport
8
Skoda
Octavia 1.8 TSI Style Plus AT [2017]
9
Nissan
Terrano XL (D)
/tmp/ipykernel_498/1495187648.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['engine'] = used_car_price['engine'].str.extract('(\d+)').astype(int)
/tmp/ipykernel_498/1495187648.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['max_power'] = used_car_price['max_power'].str.extract('(\d+)').astype(int)
/tmp/ipykernel_498/1495187648.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['max_torque'] = used_car_price['max_torque'].str.extract('(\d+)').astype(int)
/shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:4509: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().replace(
31
11100
2016
32
6420
2015
34
1740
2009
35
4908
2018
36
5880
2018
37
22800
2019
38
46200
2017
39
13500
2018
40
27600
2014
41
13140
2016