Credit Card Fraud Detection using Decision Tree Classifier
Import libraries
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: You are using pip version 22.0.4; however, version 22.3.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Read the data
The data was obtained from Kaggle.
This dataset presents transactions that occurred in two days. The column Class take the values 0 an 1, 1 if there is a fraud in the transaction.
Explore
We are going to explore the dataset. First we are going to see the head of it.
0
0.0
-1.3598071336738
1
0.0
1.19185711131486
2
1.0
-1.35835406159823
3
1.0
-0.966271711572087
4
2.0
-1.15823309349523
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Time 284807 non-null float64
1 V1 284807 non-null float64
2 V2 284807 non-null float64
3 V3 284807 non-null float64
4 V4 284807 non-null float64
5 V5 284807 non-null float64
6 V6 284807 non-null float64
7 V7 284807 non-null float64
8 V8 284807 non-null float64
9 V9 284807 non-null float64
10 V10 284807 non-null float64
11 V11 284807 non-null float64
12 V12 284807 non-null float64
13 V13 284807 non-null float64
14 V14 284807 non-null float64
15 V15 284807 non-null float64
16 V16 284807 non-null float64
17 V17 284807 non-null float64
18 V18 284807 non-null float64
19 V19 284807 non-null float64
20 V20 284807 non-null float64
21 V21 284807 non-null float64
22 V22 284807 non-null float64
23 V23 284807 non-null float64
24 V24 284807 non-null float64
25 V25 284807 non-null float64
26 V26 284807 non-null float64
27 V27 284807 non-null float64
28 V28 284807 non-null float64
29 Amount 284807 non-null float64
30 Class 284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
We can see that there is not null values.
Now we are going to plot the Class Balance.
We have an imbalanced dataset. Our majority class is far bigger than our minority class.
Split
We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.
X shape: (284807, 30)
y shape: (284807,)
X_train shape: (227845, 30)
y_train shape: (227845,)
X_test shape: (56962, 30)
y_test shape: (56962,)
Resample
We see that the data set es imbalanced.
We are going to resample the training data. We are going to cretae a new feature matrix X_train_under and target vector y_train_under by performing random under-sampling on our training data.
(788, 30)
0
69950.0
1.28421326355027
1
149502.0
0.15540367513572
2
70393.0
1.13587766408429
3
165837.0
1.99283274579382
4
164684.0
-0.385450720632189
Now we are going to cretae a new feature matrix X_train_over and target vector y_train_over by performing random over-sampling on our training data.
(454902, 30)
0
143352.0
1.95504092199146
1
117173.0
-0.400975238728654
2
149565.0
0.0725090163689562
3
93670.0
-0.535045380949255
4
82655.0
-4.02693795043132
Model
Baseline
We calculate the baseline accuracy score for our model.
Baseline Accuracy: 0.9983
Iterate
We build the model and fit three models. The first, model_reg, it is fit in X_train and y_train. The second, model_under, it is fit in X_train_under and y_train_under. Finally, the third, model_over, it is fit in X_train_over and y_train_over.
Evaluate
We are going to evaluate the three models.
Training Accuracy: 1.0
Test Accuracy: 0.9991
Training Accuracy: 0.8945
Test Accuracy: 0.895
Training Accuracy: 1.0
Test Accuracy: 0.9992
The training and test accuracy for model_under is not good. It does not perform well. The other two models are good. These beat the baseline.
Now, we are going to plot a confusion matrix that shows how the model_over performs on our test set.
Communicate
We obtain the features and importances of our model and plot them.
We can see that the feature V14 is the one with the highest Gini Importance.