Lab 7 - Machine Learning (Regression)
Part I: Answer the following questions:
1. How can we use regression analysis to identify the key factors contributing to the target variable?
To identify the relationship between variables, we can use the correlation coefficient to measure the strength and direction of its relationship.
2. What are the assumptions that linear regression has on data?
Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study, or region of residence, need to be recorded as binary (dummy) variables or other types of contrast variables.
3. What is collinearity? What will impact the regression model if the data contains collinear columns?
4. What is the interaction effect in regression analysis? How can we test for the impact of the interaction terms in the regression model?
5. Regression Through Origin is a type of regression model where the intercept term is set to zero (or without the intercept term). Discuss the effect of using such a model compared to a model with an intercept term, and when it should be used.
6. How can regression analysis be used in the feature selection process?
Regression analysis can be used in the feature selection process to identify the most important variables that are related to the response variables by using correlation analysis to remove the non-relate variables.
8. What are two methods that can be used to encode a categorical variable into a numerical variable? When should each be used?
9. Give an example of common metrics that are used to evaluate the performance of the regression model.
The larger coefficient isn't necessarily more important than another one because each coefficient represents the change in different types of variables or measures in a different unit.
Part II: Use the used_car_price dataset from here. Create a regression model to predict vehicle sale price from given attributes (some feature engineering is required).
The target column is price_usd.