Credit Card Fraud Detection using Gradient Boosting Trees
Import libraries
Collecting imblearn
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting imbalanced-learn
Downloading imbalanced_learn-0.10.1-py3-none-any.whl (226 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.0/226.0 KB 35.3 MB/s eta 0:00:00
Installing collected packages: imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.10.1 imblearn-0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Collecting ipywidgets
Downloading ipywidgets-8.0.4-py3-none-any.whl (137 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 137.8/137.8 KB 23.8 MB/s eta 0:00:00
Collecting jupyterlab-widgets~=3.0
Downloading jupyterlab_widgets-3.0.5-py3-none-any.whl (384 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 384.3/384.3 KB 7.0 MB/s eta 0:00:00
Collecting widgetsnbextension~=4.0
Downloading widgetsnbextension-4.0.5-py3-none-any.whl (2.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 112.9 MB/s eta 0:00:00
Installing collected packages: widgetsnbextension, jupyterlab-widgets, ipywidgets
Successfully installed ipywidgets-8.0.4 jupyterlab-widgets-3.0.5 widgetsnbextension-4.0.5
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Read the data
The data was obtained from Kaggle.
This dataset presents transactions that occurred in two days. The column Class take the values 0 an 1, 1 if there is a fraud in the transaction.
Explore
We are going to explore the dataset. First we are going to see the head of it.
0
0.0
-1.3598071336738
1
0.0
1.19185711131486
2
1.0
-1.35835406159823
3
1.0
-0.966271711572087
4
2.0
-1.15823309349523
We can see that there is not null values.
Now we are going to plot the Class Balance.
We have an imbalanced dataset. Our majority class is far bigger than our minority class.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.
X shape: (284807, 30)
y shape: (284807,)
X_train shape: (227845, 30)
y_train shape: (227845,)
X_test shape: (56962, 30)
y_test shape: (56962,)
Resample
We see that the data set is imbalanced.
We are going to cretae a new feature matrix X_train_over and target vector y_train_over by performing random over-sampling on our training data. We choose over sampling because we saw in the previous project that under sampling was not good.
(454902, 30)
0
143352.0
1.95504092199146
1
117173.0
-0.400975238728654
2
149565.0
0.0725090163689562
3
93670.0
-0.535045380949255
4
82655.0
-4.02693795043132
Model
Baseline
We calculate the baseline accuracy score for our model.
Baseline Accuracy: 0.9983
Iterate
We are going to create a model named clf, it is going to contain Gradient Boosting Classifier
Pipeline(steps=[('gradientboostingclassifier',
GradientBoostingClassifier(random_state=42))])
Now we create a dictionary with the range of hyperparameters that we are going to evaluate for our classifier.
We create a GridSearchCV to which we called model. This is going to include clf and hyperparameter grid.
We fit model to the over-sampled training data.
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Now we extract the cross-validation results from the model
8
114.17910232543946
4.266099356385528
7
96.36223621368408
0.36948240608282357
6
76.98166213035583
0.6307917268536979
5
90.68267741203309
0.9608229506759278
4
74.09340267181396
0.39072676349905877
3
60.154942464828494
0.3031443893893782
2
62.33220171928406
0.5278362134150776
1
51.12425374984741
0.6441260278476304
0
40.29000315666199
0.17956618130221969
Now we extract the best hyperparameters from model.
Evaluate
We are going to evaluate the model.
Calculate the training and test accuracy scores for model.
Training Accuracy: 0.9894
Test Accuracy: 0.9891
Now we make the confusion Matrix. First we count how many observation in y_test belong to positive and negative class.
Now we plot hte confusion matrix.
We can see that there are more true positives than when we used the Random Forest Classifier. In this case of projects, we need to understand what we need. If we want to reduce the false positives or the false negatives. Of this depends on which algorithm we are going to use.
Now we are going to see the precision and the recall of this model, this are going to be more important to evaluate it than the accuracy score.
Now we print the classification report for the model, using the test set.
precision recall f1-score support
0 1.00 0.99 0.99 56864
1 0.13 0.92 0.22 98
accuracy 0.99 56962
macro avg 0.56 0.95 0.61 56962
weighted avg 1.00 0.99 0.99 56962
We can see that the recall is higher than the precision for the positive class. This is because in this model the False Negatives are lower, and the True Positives are higher, for this reason recall is higher. In other hand, the False Negatives are higher, that is the reason why the precision is lower.
Finally, we are going to make a function to make predictions in which we can change the threshold. This is a parameter that affects the recall and precision.
If the threshold is high, the recall is low and the precision high. If the threshold is low, the recall is high and the precision is low.
The value we choose of threshold depends on which one we want higher, recall or precision. This is a decision we are going to take depending on for what we are going to use the results we obtain.
Communicate
We obtain the features and importances of our model and plot them.
We can see that the most importan feature is V14.