Data Pre-Processing: - Missing Values Treatment: Numerical (Mean/Median imputation) and Categorical (Separate Missing Category or Merging) - Univariate Analysis: Outlier and Frequency Analysis
The dataset contains some missing values, and upon closer examination, it was discovered that the missing values in the "default" column are associated with a new group of customers.
-> To address outliers in the dataset, we applied the technique of Winsorization.
-> We examined the distribution of customers who defaulted and those who did not in the dataset to determine if it is a balanced or imbalanced dataset.
Data Exploratory Analysis: - Bivariate Analysis - Numeric(T-test)/ Categorical(Chi-square) - Bivariate Analysis - Visualization - Variable Reduction - Multicollinearity
Bi-variate Analysis:
Multi Collinearity Check
Observations:
There are 850 observations and 9 features in the data set - All 9 features are numerical in nature - There are no missing values in the data set - Out of 850 customers data, 700 are existing customers and 150 are new customers - In the 700 existing customers, 517 customers are tagged as non defaulters and remaining 183 are tagged as defaulters - The data is highly imbalanced - From VIF check, found out that the correlation between the variables is within the acceptable limits
Model Building and Model Diagnostics
Logistics Regression & Decision Tree Classification:
Train/Test Split: Splitting data into training and testing sets to check the model's performance on unseen data.
Variable Significance: Assessing significance of each variable using statistical tests to determine their association with the outcome variable.
Gini and ROC/Concordance: Measures of the model's performance - Gini coefficient measures classification accuracy, ROC curve measures sensitivity vs specificity trade-off, concordance measures predictive ability.
Classification Table Accuracy: Evaluate model performance using classification table to compare predicted vs actual values and calculate accuracy.
Decision Tree Classifier: Same analysis as logistic regression with additional use of measures like Gini index, Chi-square or Information gain to determine significant variables. ROC curves and concordance analysis may not be applicable as the output is categorical. Confusion matrix can also be used to evaluate model performance.
Logistic Regression
Model Performance
Test dataset:
Recall measures the ratio of the total number of correctly classified positive examples (True Positives) to the total number of positive examples (True Positives + False Negatives). A high Recall indicates that the model is correctly recognizing the positive examples in the dataset.
Precision measures the ratio of the total number of correctly classified positive examples (True Positives) to the total number of predicted positive examples (True Positives + False Positives). A high Precision indicates that an example labeled as positive by the model is indeed positive.
Inference: The model's overall test accuracy is 80%, but this metric is not sufficient to evaluate the model's performance because the primary objective is to identify customers who are likely to default. However, there are numerous cases where customers have defaulted but the model has predicted them as not defaulting, indicating a high rate of false negatives.
-> To improve the model's ability to identify customers who are likely to default, it may be necessary to adjust the threshold for determining default risk from the default assumption of 0.5 probability. The bank can intervene and take action based on this more accurate prediction of default risk.
Find the optimum cutoff value
Cutoff would be optimum where specificity and sensitivity would be maximum for the given cutoff
Inference:
The model's overall accuracy may be reduced from 80% to 75% by adjusting the cutoff to 0.224, but this change can significantly improve the recall score from 54% to 89%. Recall score is an important metric because it measures the model's ability to identify all positive samples, or customers who are likely to default. However, this adjustment comes at a cost, as the precision score drops from 67% to 52%. Precision score measures the model's ability not to label non-default customers as default customers. The choice of cutoff value ultimately depends on the business's priorities and the value placed on true positives versus false positives. In practice, the cutoff value is often determined by business decisions.
Decision Tree Classifier
Effective machine learning models: -> While cross-validation is an important process in evaluating a model, it is not specifically focused on finding the best combination of parameters. Instead, cross-validation is a technique for assessing how well a model will generalize to new data.
-> Hyperparameter tuning, on the other hand, involves selecting the best hyperparameters for a model, such as the learning rate or number of layers in a neural network. This is typically done by training multiple models with different hyperparameters and evaluating their performance on a validation set. Cross-validation can be used as part of this process to estimate the generalization performance of each model.
Declare a hyper-parameters to fine the Decision Tree Classifier:
Decision tree algorithm can lead to overfitting of training data, resulting in poor performance on unseen data. Pruning is a process used to prevent this by stopping the tree from growing too complex.
-> Hyperparameters in decision tree control pruning and include parameters like maximum depth of the tree and minimum number of samples required to split an internal node.
-> Tuning hyperparameters can optimize the model's performance by finding the best combination of parameters.
Cross-validation is used to evaluate the effectiveness of different hyperparameter settings by training the model on one subset of data and testing it on another, allowing for an understanding of how well the model generalizes to unseen data and which hyperparameters work best.
Decision Tree classifier with gini index -> Fit and tuning models with cross-validation
Now that we have our pipelines and hyperparameters dictionaries declared, we're ready to tune our models with cross-validation.
We are doing 5-fold cross validation
Model Performace Evaluation
Visualization of Decision Tree
Dependencies:
Model Selection and Business Insights
The logistic regression model has shown better performance than the decision tree model based on their respective F1-scores, with the logistic model having an F1-score of 0.66 for positive labels (default customers) compared to 0.44 for the decision tree model. Therefore, the logistic model will be used to predict the creditworthiness of the remaining 150 customers. A cutoff of 0.224 will be used to classify customers as either default or non-default.
Insights
The results of the logistic regression model's predictions on a set of 150 new customers. According to the model's predictions, out of these 150 customers:
-> 85 customers are predicted to not default on the bank loan -> 65 customers are predicted to most likely default on the loan.
Model Performance Validation: - KS Chart - Lift and Gain Chart we will use the concept of decile analysis for these validations
To get the Training dataset & Testing dataset:
KS, Lift, and Gain charts for both the training and testing datasets to compare the model's performance on both datasets. This can help us determine whether the model is overfitting to the training dataset and can also provide insights into the model's performance on unseen data.
Observations: The analysis shows that the gain chart indicates that 90% of the defaulters who are likely to default on the loan can be identified by analyzing just 50% of the total customers. The lift chart indicates that by selecting 20% of the records based on the model, one can expect 2.7 times the total number of defaulters to be found than by randomly selecting 20% of the data without a model. The winning model should be saved using the standard method of serializing objects in Python, which is the pickle operation, to allow for future reuse in testing the model on new data, comparing multiple models, or other purposes.
Later you can load this file to deserialize your model and use it to make new predictions.