Credit Card Fraud Detection using Decision Tree Classifier

Import libraries

import sys !{sys.executable} -m pip install imblearn | grep -v 'Requirement already satisfied' #from imblearn import RandomOverSampler, RandomUnderSampler;

from imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler

import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from sklearn.impute import SimpleImputer from sklearn.metrics import ConfusionMatrixDisplay from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.tree import DecisionTreeClassifier

Read the data

df = pd.read_csv("/work/creditcard.csv")

The data was obtained from Kaggle.

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

This dataset presents transactions that occurred in two days. The column Class take the values 0 an 1, 1 if there is a fraud in the transaction.

Explore

We are going to explore the dataset. First we are going to see the head of it.

df.head()

df.shape

df.info()

df.isnull().sum().sum()

We can see that there is not null values.

Now we are going to plot the Class Balance.

# Plot class balance df["Class"].value_counts(normalize=True).plot( kind="bar", xlabel="Fraud", ylabel="Frequency", title="Class Balance" );

We have an imbalanced dataset. Our majority class is far bigger than our minority class.

Split

We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.

target = "Class" X = df.drop(columns=target) y = df[target] print("X shape:", X.shape) print("y shape:", y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("X_train shape:", X_train.shape) print("y_train shape:", y_train.shape) print("X_test shape:", X_test.shape) print("y_test shape:", y_test.shape)

Resample

We see that the data set es imbalanced.

y_train.value_counts()

We are going to resample the training data. We are going to cretae a new feature matrix X_train_under and target vector y_train_under by performing random under-sampling on our training data.

under_sampler = RandomUnderSampler(random_state=42) X_train_under, y_train_under = under_sampler.fit_resample(X_train, y_train) print(X_train_under.shape) X_train_under.head()

y_train_under.value_counts(normalize=True)

Now we are going to cretae a new feature matrix X_train_over and target vector y_train_over by performing random over-sampling on our training data.

over_sampler = RandomOverSampler(random_state=42) X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train) print(X_train_over.shape) X_train_over.head()

y_train_over.value_counts(normalize=True)

Model

Baseline

We calculate the baseline accuracy score for our model.

acc_baseline = y_train.value_counts(normalize=True).max() print("Baseline Accuracy:", round(acc_baseline, 4))

Iterate

We build the model and fit three models. The first, model_reg, it is fit in X_train and y_train. The second, model_under, it is fit in X_train_under and y_train_under. Finally, the third, model_over, it is fit in X_train_over and y_train_over.

# Fit on `X_train`, `y_train` model_reg = DecisionTreeClassifier(random_state=42) model_reg.fit(X_train, y_train) # Fit on `X_train_under`, `y_train_under` model_under = DecisionTreeClassifier(random_state=42) model_under.fit(X_train_under, y_train_under) # Fit on `X_train_over`, `y_train_over` model_over = DecisionTreeClassifier(random_state=42) model_over.fit(X_train_over, y_train_over)

Evaluate

We are going to evaluate the three models.

for m in [model_reg, model_under, model_over]: acc_train = m.score(X_train, y_train) acc_test = m.score(X_test, y_test) print("Training Accuracy:", round(acc_train, 4)) print("Test Accuracy:", round(acc_test, 4))

The training and test accuracy for model_under is not good. It does not perform well. The other two models are good. These beat the baseline.

Now, we are going to plot a confusion matrix that shows how the model_over performs on our test set.

# Plot confusion matrix ConfusionMatrixDisplay.from_estimator(model_over, X_test, y_test)

Communicate

We obtain the features and importances of our model and plot them.

# Get importances importances = model_over.feature_importances_ # Put importances into a Series feat_imp = pd.Series(importances, index=X_train_over.columns).sort_values() # Plot series feat_imp.tail(15).plot(kind="barh") plt.xlabel("Gini Importance") plt.ylabel("Feature") plt.title("model_over Feature Importance");

We can see that the feature V14 is the one with the highest Gini Importance.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Credit Card Fraud Detection using Decision Tree Classifier