Mortgage Data Exercise (6/22/21) | Mathaus Silva
In preparation for next class' introduction to machine learning, we'll be cleaning the mortgage testing dataset. First, we import the appropriate libraries, load in the mortgage data, and check the data types of the columns.
In order to use this data for a future numeric model, we will be replacing categorical columns seen above (labeled 'object') with boolean columns.
First, we replace the original 'conforming_loan_limit' column with two boolean columns, where 'conforming_loan_limit_c' means conforming loan limit is C (conforming) and 'conforming_loan_limit_nc' means conforming loan limit is NC (not conforming).
Since a loan can only either be conforming or non-conforming, 'conforming_loan_limit_c' and 'conforming_loan_limit_nc' will never store the same value. A similar process will be repeated for 'derived_sex' and 'action_taken' columns.
We replace the original 'derived_sex' column with two boolean columns, where 'derived_sex_m' means derived sex is Male and 'derived_sex_f' means derived sex is Female.
As for the action_taken column, it only contains 1s and 2s. This is because the dataset was filtered to only include accepted or rejected mortgages (no withdrawals, pre-approvals, etc.). We are replacing this column with another boolean column, using 0s and 1s for False/True, meaning "application accepted."
Now, column 'application_accepted' outputs 0 for every denied application, and 1 for every approved application.
The 'debt_to_income_ratio' column is categorical as opposed to numeric. We can make it numeric by replacing each category with a central value in that category. In this case, we'll be using a dictionary to assign a value to a range of percentages and later mapping it to the 'debt_to_income_ratio' column. Values that don't fall into the following ranges will remain untouched.
Now that we've replaced all our categorical columns, we use 'df.dtypes' once again to check that our columns have changed to numeric data types.
As they are all now numeric columns, our DataFrame is cleaned and ready to export as a CSV file.