TERM PAPER
Factors that affect loan giving decision of banks
The two most pressing issues in the banking sector are: 1) How risky is the borrower? 2) Should we lend to the borrower given the risk? The response to the first question dictates the borrower's interest rate. Interest rate, among other things (such as time value of money), tests the riskiness of the borrower, i.e. the higher the interest rate, the riskier the borrower. We will then decide whether the applicant is suitable for the loan based on the interest rate. Lenders (investors) make loans to creditors in return for the guarantee of interest-bearing repayment. That is, the lender only makes a return (interest) if the borrower repays the loan. However, whether he or she does not repay the loan, the lender loses money. Banks make loans to customers in exchange for the guarantee of repayment. Some would default on their debts, unable to repay them for a number of reasons. The bank retains insurance to minimize the possibility of failure in the case of a default. The insured sum can cover the whole loan amount or just a portion of it. Banking processes use manual procedures to determine whether or not a borrower is suitable for a loan based on results. Manual procedures were mostly effective, but they were insufficient when there were a large number of loan applications. At that time, making a decision would take a long time. As a result, the loan prediction machine learning model can be used to assess a customer's loan status and build strategies. This model extracts and introduces the essential features of a borrower that influence the customer's loan status. Finally, it produces the planned performance (loan status). These reports make a bank manager's job simpler and quicker.
We start our literature review with more general systematic literature reviews that focus on the application of machine learning in the general field of Banking Risk Management. Since the global financial crisis, risk management in banks has to take a major role in shaping decision-making for banks. A major portion of risk management is the approval of loans to promising candidates. But the black-box nature of Machine learning algorithms makes many loan providers vary the result. Martin Leo, Suneel Sharma and k. Maddulety's [1] extensive report has explored where Machine Learning is being used in the fields of credit risk, market risk, operational risk, and liquidity risk only to conclude that the research falls short of extensive research is required in the field.
We could not find any literature review for loan prediction for specific Machine learning algorithms to use which would be a possible starting point for our paper. Instead, since loan prediction is a classification problem, we went with popular classification algorithms used for a similar problem. Ashlesha Vaidya [2] used logistic regression as a probabilistic and predictive approach to loan approval prediction. The author pointed out how Artificial neural networks and Logistic regression are most used for loan prediction as they are easier comparatively develop and provide the most accurate predictive analysis. One of the reasoning behind this that that other Algorithms are generally bad at predicting from non-normalized data. But the nonlinear effect and power terms are easily handled by Logistic regression as there is no need for the independent variables on which the prediction takes place to be normally distributed.
Logistic regression still has its limitations, and it requires a large sample of data for parameter estimation. Logistic regression also requires that the variables be independent of each other otherwise the model tends to overweigh the importance of the dependent variables.
A solution to this multicollinearity problem among the categorical explanatory variables is the use of a categorical principal component analysis which can be seen used by Guilder and Ozlem [3] on a case study for housing Loan approval data. The goal of Principal component analysis is to reduce the number of m variables where many of them would be highly correlated with each other, to a smaller set of n uncorrelated variables called principal components which account for the variances between the previous m variables. Methods such as PCA are known as dimension reduction of the data. It may be suitable for scaled continuous variables but it isn’t quite an appropriate method of dimension reduction for categorical variables. Thus, the authors here used a tweaked version of PCA for categorical data called CATPCA or categorical (nonlinear) principal components analysis which is specifically developed for where the dependent variables are a mix of nominal, ordinal, or numeric data which may not have linear relationships with each other. CATPCA works by using a scaling process optimized to convert the categorical variables into numeric variables.
Similar to PCA, Zaghdoudi, Djebali & Mezni [4] compared the use of Linear Discriminant Analysis versus Logistic Regression for Credit Scoring and Default Risk Prediction for foreseeing default risk o small and medium enterprises. Linear Discriminant Analysis (LDA) is like PCA for dimensionality reduction but instead of looking for the most variation, LDA focuses on maximizing the separability among the know categories. This subspace that well separates the classes is usually in which a linear classifier can be learned. The classification of those enterprises correctly in their original groups through both these methods was inconsequential with Logistic regression having a 0.3% better accuracy score than LDA.
Another novel approach for T.Sunitha and colleagues [5] was to predict loan Status using Logistic Regression and a Binary Tree. Decision Tree is an algorithm for a predictive type machine learning model.
Classification and Regression Trees are referred to as CART (in short) introduced by Leo Breiman. It best suits both predictive and decision modeling problems. This Binary Tree methodology is the greedy method is used for the selection of the best splitting. Although Decision trees gave us a similar accuracy. The benefits of Decision Trees, in this case, were due to the latter giving equal importance to both accuracy and prediction. This model became successful in making a lower number of False Predictions to reduce the risk factor.
Rajiv Kumar and Vinod Jain [6] proposed a model using machine learning algorithms to predict the loan approval of customers. They applied three machine learning algorithms, Logistic Regression (LR), Decision Tree (DT), and Random Forest (RF) using Python on a test data set. From the results, they concluded that the Decision Tree machine learning algorithm performs better than Logistic Regression and Random Forest machine learning approaches. It also opens other areas on which the Decision Tree algorithm is applicable.
Some machine learning models give different weights to each factor but in practice sometimes loans can be sanctioned based on a single strong factor only. To eliminate this problem J. Tejaswini and T. Mohana Kavya [7] in their research paper have built a loan prediction system that automatically calculates the weight of each feature taking part in loan processing and on new test data the same features are processed concerning their associated weight. They have implemented six machine learning classification models using R for choosing the deserving loan applicants. The models include Decision Trees, Random Forest, Support Vector Machine, Linear Models, Neural Network and Adaboost. The authors concluded that the accuracy of the Decision Tree is highest among all models and performs better on the loan prediction system.
Predicting loan defaulters is an important process of the banking system as it directly affects profitability. However, loan default data sets available are highly imbalanced which results in poor performance of the algorithms. Lifeng Zhou and Hong Wang [8] in their call for paper made loan default prediction on imbalanced data sets using an improved random forests approach. In this approach, the authors have employed weights in decision tree aggregation. The weights are calculated and assigned to each tree in the forest during the forest construction process using Out-of-bag (OOB) errors. The experimental results conclude that the improved algorithm performs better and has better accuracy than the original random forest and other popular classification algorithms such as SVM, KNN, and C4.5. The research opens improvements in terms of efficiency of the algorithm if parallel random forests can be used for further work.
Anchal Goyal and Ranpreet Kaur [9] discuss various ensemble algorithms. Ensemble algorithm is a supervised machine learning algorithm that is a combination of two or more algorithms to get better predictive performance. They carried out a systematic literature review to compare ensemble models with various stand-alone models such as neural network, SVM, regression, etc. The authors after reviewing different literature reviews concluded that the Ensemble Model performs better than the stand-alone models. Finally, they concluded that the concept of combined algorithms also improves the accuracy of the model.
Data Mining is also becoming popular in the field banking sector as it extracts information from a tremendous amount of accumulated data sets. Aboobyda Jafar Hamid and Tarig Mohammed Ahmed [10] focused on implementing data mining techniques using three models j48, bayesNet, and naiveBayesdel for classifying loan risk in the banking sector. The author implemented and tested models using the Weka application. In their work, they made a comparison between these algorithms in terms of accuracy in classifying the data correctly. The operation of sprinting happened in a manner that 80% represented the training dataset and 20% represented the testing dataset. After analyzing the results the author came up with the results that the best algorithm among the three is the J48w algorithm in terms of high accuracy and low mean absolute error.
The Dataset features are :
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
The Datatypes of features are as :
Loan_ID object
Gender object
Married object
Dependents object
Education object
Self_Employed object
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area object
Loan_Status object
dtype: object
We have 614 rows ans 13 columns. 12 independent variable and target variable:
(614, 13)
Exploratory Data Analysis
As we saw in Bivariate Analysis that the conditional probability of getting a loan is equal across male as well as female. So now we need to conduct a statistical test to confirm this assumption. Null hypothesis : male and female has equal chances of getting a loan; [m1 == m2] Alternate hypothesis : not equal chances; [m1 != m2] p value > 0.05 , we can not reject the null hypothesis and confirm that there is equal chance of getting a loan for male as well as female.
contingency_table :-
Loan_Status N Y
Gender
Female 37 75
Male 150 339
Observed Values :-
[[ 37 75]
[150 339]]
Expected Values :-
[[ 34.84858569 77.15141431]
[152.15141431 336.84858569]]
Degree of Freedom:- 1
chi-square statistic:- 0.23697508750826923
critical_value: 3.841458820694124
p-value: 0.6263994534115932
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 0.23697508750826923
critical_value: 3.841458820694124
p-value: 0.6263994534115932
Retain H0,There's no relationship between 2 categorical variables as Chi_square is < critical_value &
Retain H0,as p-value > alpha
As we saw in Bivariate Analysis that the conditional probability of getting a loan is not equal across different property area. So now we need to conduct a statistical test to confirm this assumption. Null hypothesis: every property area has equal chances of getting a loan; [m1 == m2] Alternate hypothesis : not equal chances; [m1 != m2]
contingency_table :-
Loan_Status N Y
Property_Area
Rural 69 110
Semiurban 54 179
Urban 69 133
Observed Values :-
[[ 69 110]
[ 54 179]
[ 69 133]]
Expected Values :-
[[ 55.97394137 123.02605863]
[ 72.85993485 160.14006515]
[ 63.16612378 138.83387622]]
Degree of Freedom:- 1
chi-square statistic:- 12.297623130485677
critical_value: 3.841458820694124
p-value: 0.0004535354999599672
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 12.297623130485677
critical_value: 3.841458820694124
p-value: 0.0004535354999599672
Reject H0,There's a relationship between 2 categorical variables as Chi_square is >= critical_value &
Reject H0 as p-value <= alpha
As we saw in Bivariate Analysis that the conditional probability of getting a loan is not equal across different credit history. Null hypothesis : equal chances of getting a loan for different credit history Alternate hypothesis : not equal chances
contingency_table :-
Loan_Status N Y
Credit_History
0.0 82 7
1.0 97 378
Observed Values :-
[[ 82 7]
[ 97 378]]
Expected Values :-
[[ 28.2464539 60.7535461]
[150.7535461 324.2464539]]
Degree of Freedom:- 1
chi-square statistic:- 177.9320468515689
critical_value: 3.841458820694124
p-value: 0.0
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 177.9320468515689
critical_value: 3.841458820694124
p-value: 0.0
Reject H0,There's a relationship between 2 categorical variables as Chi_square is >= critical_value &
Reject H0 as p-value <= alpha
We assumed that there is an association between getting a loan and Education, with a difference of just 9% in conditional probabilities. Null hypothesis : equal chances of getting a loan for different education background Alternate hypothesis : not equal chances
contingency_table :-
Loan_Status N Y
Education
Graduate 140 340
Not Graduate 52 82
Observed Values :-
[[140 340]
[ 52 82]]
Expected Values :-
[[150.09771987 329.90228013]
[ 41.90228013 92.09771987]]
Degree of Freedom:- 1
chi-square statistic:- 4.5288927351787684
critical_value: 3.841458820694124
p-value: 0.03332717442347588
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 4.5288927351787684
critical_value: 3.841458820694124
p-value: 0.03332717442347588
Reject H0,There's a relationship between 2 categorical variables as Chi_square is >= critical_value &
Reject H0 as p-value <= alpha
Earlier we found that except dependents = 0, others had an association with Loan Status. Null hypothesis : equal chances of getting a loan for different no. of Dependents Alternate hypothesis : not equal chances
contingency_table :-
Loan_Status N Y
Education
Graduate 140 340
Not Graduate 52 82
Observed Values :-
[[140 340]
[ 52 82]]
Expected Values :-
[[150.09771987 329.90228013]
[ 41.90228013 92.09771987]]
Degree of Freedom:- 1
chi-square statistic:- 4.5288927351787684
critical_value: 3.841458820694124
p-value: 0.03332717442347588
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 4.5288927351787684
critical_value: 3.841458820694124
p-value: 0.03332717442347588
Reject H0,There's a relationship between 2 categorical variables as Chi_square is >= critical_value &
Reject H0 as p-value <= alpha
Earlier we assumed that there is no association between Self_Employed and Loan_Status. Null hypothesis : equal chances of getting a loan for Self employment Alternate hypothesis : not equal chances
contingency_table :-
Loan_Status N Y
Self_Employed
No 157 343
Yes 26 56
Observed Values :-
[[157 343]
[ 26 56]]
Expected Values :-
[[157.21649485 342.78350515]
[ 25.78350515 56.21649485]]
Degree of Freedom:- 1
chi-square statistic:- 0.0030864285864601513
critical_value: 3.841458820694124
p-value: 0.955695807982182
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 0.0030864285864601513
critical_value: 3.841458820694124
p-value: 0.955695807982182
Retain H0,There's no relationship between 2 categorical variables as Chi_square is < critical_value &
Retain H0,as p-value > alpha
Earlier we saw that probability of getting a loan for married people was high by around 9%. Null hypothesis : equal chances of getting a loan for Married Alternate hypothesis : not equal chances
contingency_table :-
Loan_Status N Y
Married
No 79 134
Yes 113 285
Observed Values :-
[[ 79 134]
[113 285]]
Expected Values :-
[[ 66.93289689 146.06710311]
[125.06710311 272.93289689]]
Degree of Freedom:- 1
chi-square statistic:- 4.870255560503662
critical_value: 3.841458820694124
p-value: 0.027323453771962658
Significance level: 0.05
Degree of Freedom: 1
chi-square statistic: 4.870255560503662
critical_value: 3.841458820694124
p-value: 0.027323453771962658
Reject H0,There's a relationship between 2 categorical variables as Chi_square is >= critical_value &
Reject H0 as p-value <= alpha
Gender Married Dependents Education Self_Employed ApplicantIncome \
0 Male No 0 Graduate No 5849
1 Male Yes 1 Graduate No 4583
2 Male Yes 0 Graduate Yes 3000
3 Male Yes 0 Not Graduate No 2583
4 Male No 0 Graduate No 6000
CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History \
0 0.0 128.0 360.0 1.0
1 1508.0 128.0 360.0 1.0
2 0.0 66.0 360.0 1.0
3 2358.0 120.0 360.0 1.0
4 0.0 141.0 360.0 1.0
Property_Area Loan_Status Income_bin
0 Urban 1 High
1 Rural 0 High
2 Urban 1 Average
3 Urban 1 Average
4 Urban 1 High
Correlation between Quantitative Variables
Due to outliers in the Loan Amount. the data in the loan amount is skewed towards the right, which means bulk of the data is towards the left. We remove this skewness by doing a log transformation. A log transformation doesn't effect affect the smaller values much but reduces the larger values. So the distribution becomes normal
Optimization terminated successfully.
Current function value: 0.464602
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: Loan_Status No. Observations: 614
Model: Logit Df Residuals: 599
Method: MLE Df Model: 14
Date: Wed, 14 Apr 2021 Pseudo R-squ.: 0.2521
Time: 04:43:39 Log-Likelihood: -285.27
converged: True LL-Null: -381.45
Covariance Type: nonrobust LLR p-value: 1.988e-33
========================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------
Gender -0.0718 0.291 -0.247 0.805 -0.641 0.498
Married 0.5921 0.241 2.460 0.014 0.120 1.064
Dependents 0.0347 0.131 0.265 0.791 -0.221 0.291
Education -0.4570 0.257 -1.778 0.075 -0.961 0.047
Self_Employed 0.0969 0.318 0.305 0.761 -0.526 0.720
ApplicantIncome 2.445e-05 2.44e-05 1.003 0.316 -2.33e-05 7.22e-05
CoapplicantIncome -6.088e-05 3.62e-05 -1.684 0.092 -0.000 9.99e-06
LoanAmount -0.0012 0.002 -0.713 0.476 -0.004 0.002
Loan_Amount_Term -0.0008 0.002 -0.469 0.639 -0.004 0.003
Credit_History 3.8867 0.417 9.315 0.000 3.069 4.705
Property_Area 0.0821 0.136 0.603 0.547 -0.185 0.349
Income_bin_Low -2.3331 0.884 -2.640 0.008 -4.065 -0.601
Income_bin_Average -2.2427 0.828 -2.708 0.007 -3.866 -0.619
Income_bin_High -2.5717 0.866 -2.969 0.003 -4.269 -0.874
Income_bin_Very high -2.8351 0.911 -3.112 0.002 -4.621 -1.050
========================================================================================
Optimization terminated successfully.
Current function value: 0.539390
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: Loan_Status No. Observations: 614
Model: Logit Df Residuals: 613
Method: MLE Df Model: 0
Date: Wed, 14 Apr 2021 Pseudo R-squ.: 0.1318
Time: 04:43:39 Log-Likelihood: -331.19
converged: True LL-Null: -381.45
Covariance Type: nonrobust LLR p-value: nan
==================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------
Credit_History 1.3278 0.107 12.381 0.000 1.118 1.538
==================================================================================
Optimization terminated successfully.
Current function value: 0.465379
Iterations 6
Logit Regression Results
======