Lab 7 - Machine Learning (Regression)

Part I: Answer the following questions:

1. How can we use regression analysis to identify the key factors contributing to the target variable?

To identify the relationship between variables, we can use the correlation coefficient to measure the strength and direction of its relationship.

2. What are the assumptions that linear regression has on data?

Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recorded as binary (dummy) variables or other types of contrast variables.

3. What is collinearity? What will impact the regression model if the data contains collinear columns?

Collonaarity is a case when two or more variables are correlated.

If the data contains collinear columns, the coefficient estimates of independent variables would be very sensitive to the change in the model and would inflate the variance and standard error of the coefficient estimates.

4. What is the interaction effect in regression analysis? How can we test for the impact of the interaction terms in the regression model?

The interaction effect is a case when the effect of an independent variable on dependent variable changes, depending on the value(s) of one or more other independent variables.

We can test for the effect of interaction terms by using an interaction plot and comparing the lines from the interaction plot; If the lines are parallel, there is no interaction, If not then there is an interaction.

5. Regression Through Origin is a type of regression model where the intercept term is set to zero (or without the intercept term). Discuss the effect of using such a model compared to a model with an intercept term, and when it should be used.

6. How can regression analysis be used in the feature selection process?

Regression analysis can be used in the feature selection process to identify the most important variables that are related to the response variables by using correlation analysis to remove the non-relate variables.

8. What are two methods that can be used to encode a categorical variable into a numerical variable? When should each be used?

9. Give an example of common metrics that are used to evaluate the performance of the regression model.

The larger coefficient isn't necessarily more important than another one because each coefficient represents the change in different types of variables or measures in a different unit.

Part II: Use the used_car_price dataset from here. Create a regression model to predict vehicle sale price from given attributes (some feature engineering is required).

The target column is price_usd.

import pandas as pd

used_car_price = pd.read_csv('used_car_price.csv') used_car_price

used_car_price = used_car_price.dropna()

if used_car_price['engine'].dtype == 'O': used_car_price['engine'] = used_car_price['engine'].str.extract('(\d+)').astype(int) if used_car_price['max_power'].dtype == 'O': used_car_price['max_power'] = used_car_price['max_power'].str.extract('(\d+)').astype(int) if used_car_price['max_torque'].dtype == 'O': used_car_price['max_torque'] = used_car_price['max_torque'].str.extract('(\d+)').astype(int) used_car_price["owner"].replace(["UnRegistered Car", "First", "Second", "Third"], [0,1,2,3], inplace=True)

used_car_price_prep = used_car_price.drop(columns=['model'])

used_car_price_num = used_car_price_prep.select_dtypes("number") used_car_price_object = used_car_price_prep.select_dtypes("object") used_car_price_prep = pd.concat( [used_car_price_num, pd.get_dummies(used_car_price_object, drop_first=True)], axis=1 ) used_car_price_prep

used_car_price_x = used_car_price_prep.drop(columns=["price_usd"]) used_car_price_y = used_car_price_prep["price_usd"]

try: import statsmodels.api as sm except ImportError as e: !pip install statsmodels import statsmodels.api as sm

model = sm.OLS(used_car_price_y,used_car_price_x).fit() model.summary()

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Lab 7 - Machine Learning (Regression)