# Lab 7 - Machine Learning (Regression)

## Part I: Answer the following questions:

### 1. How can we use regression analysis to identify the key factors contributing to the target variable?

To identify the relationship between variables, we can use the correlation coefficient to measure the strength and direction of its relationship.

### 2. What are the assumptions that linear regression has on data?

Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recorded as binary (dummy) variables or other types of contrast variables.

### 3. What is collinearity? What will impact the regression model if the data contains collinear columns?

Collonaarity is a case when two or more variables are correlated.

If the data contains collinear columns, the coefficient estimates of independent variables would be very sensitive to the change in the model and would inflate the variance and standard error of the coefficient estimates.

### 4. What is the interaction effect in regression analysis? How can we test for the impact of the interaction terms in the regression model?

The interaction effect is a case when the effect of an independent variable on dependent variable changes, depending on the value(s) of one or more other independent variables.

We can test for the effect of interaction terms by using an interaction plot and comparing the lines from the interaction plot; If the lines are parallel, there is no interaction, If not then there is an interaction.

### 5. Regression Through Origin is a type of regression model where the intercept term is set to zero (or without the intercept term). Discuss the effect of using such a model compared to a model with an intercept term, and when it should be used.

### 6. How can regression analysis be used in the feature selection process?

Regression analysis can be used in the feature selection process to identify the most important variables that are related to the response variables by using correlation analysis to remove the non-relate variables.

### 8. What are two methods that can be used to encode a categorical variable into a numerical variable? When should each be used?

### 9. Give an example of common metrics that are used to evaluate the performance of the regression model.

The larger coefficient isn't necessarily more important than another one because each coefficient represents the change in different types of variables or measures in a different unit.

## Part II: Use the used_car_price dataset from here. Create a regression model to predict vehicle sale price from given attributes (some feature engineering is required).

The target column is price_usd.

0

Honda

Amaze 1.2 VX i-VTEC

1

Maruti Suzuki

Swift DZire VDI

2

Hyundai

i10 Magna 1.2 Kappa2

3

Toyota

Glanza G

4

Toyota

Innova 2.4 VX 7 STR [2016-2020]

5

Maruti Suzuki

Ciaz ZXi

6

Mercedes-Benz

CLA 200 Petrol Sport

7

BMW

X1 xDrive20d M Sport

8

Skoda

Octavia 1.8 TSI Style Plus AT [2017]

9

Nissan

Terrano XL (D)

```
/tmp/ipykernel_498/1495187648.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['engine'] = used_car_price['engine'].str.extract('(\d+)').astype(int)
/tmp/ipykernel_498/1495187648.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['max_power'] = used_car_price['max_power'].str.extract('(\d+)').astype(int)
/tmp/ipykernel_498/1495187648.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
used_car_price['max_torque'] = used_car_price['max_torque'].str.extract('(\d+)').astype(int)
/shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/series.py:4509: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().replace(
```

31

11100

2016

32

6420

2015

34

1740

2009

35

4908

2018

36

5880

2018

37

22800

2019

38

46200

2017

39

13500

2018

40

27600

2014

41

13140

2016