Credit Card Fraud Detection using Random Forest Classifier

Import libraries

import sys !{sys.executable} -m pip install imblearn | grep -v 'Requirement already satisfied' #from imblearn import RandomOverSampler, RandomUnderSampler;

from imblearn.over_sampling import RandomOverSampler

import matplotlib.pyplot as plt import pandas as pd import seaborn as sns from sklearn.impute import SimpleImputer from sklearn.metrics import ConfusionMatrixDisplay from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV, cross_val_score, train_test_split

Read the data

df = pd.read_csv("/work/creditcard.csv")

The data was obtained from Kaggle.

https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

This dataset presents transactions that occurred in two days. The column Class take the values 0 an 1, 1 if there is a fraud in the transaction.

Explore

We are going to explore the dataset. First we are going to see the head of it.

df.head()

df.shape

df.isnull().sum().sum()

We can see that there is not null values.

Now we are going to plot the Class Balance.

# Plot class balance df["Class"].value_counts(normalize=True).plot( kind="bar", xlabel="Fraud", ylabel="Frequency", title="Class Balance" );

We have an imbalanced dataset. Our majority class is far bigger than our minority class.

Split

We are going to split the data frame in X and y and then we are going to use train test plit to obtain X train, X test, y train and y test.

target = "Class" X = df.drop(columns=target) y = df[target] print("X shape:", X.shape) print("y shape:", y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) print("X_train shape:", X_train.shape) print("y_train shape:", y_train.shape) print("X_test shape:", X_test.shape) print("y_test shape:", y_test.shape)

Resample

We see that the data set es imbalanced.

y_train.value_counts()

We are going to cretae a new feature matrix X_train_over and target vector y_train_over by performing random over-sampling on our training data. We choose over sampling because we saw in the previous project that under sampling was not good.

over_sampler = RandomOverSampler(random_state=42) X_train_over, y_train_over = over_sampler.fit_resample(X_train, y_train) print(X_train_over.shape) X_train_over.head()

y_train_over.value_counts(normalize=True)

Model

Baseline

We calculate the baseline accuracy score for our model.

acc_baseline = y_train.value_counts(normalize=True).max() print("Baseline Accuracy:", round(acc_baseline, 4))

Iterate

We are going to create a model named clf, it is going to contain Random Forest Classifier.

clf = make_pipeline(RandomForestClassifier(random_state=42)) print(clf)

We are going to perform cross-validation with the classifier, using the over-sampled training data.

We are going to have five folds, we set cv to 5. We want to speed up training, to set n_jobs to -1.

cv_acc_scores = cross_val_score(clf, X_train_over, y_train_over, cv=5, n_jobs=-1) print(cv_acc_scores)

Now we create a dictionary with the range of hyperparameters that we are going to evaluate for our classifier.

params = { "randomforestclassifier__max_depth": range(10,50,10), "randomforestclassifier__n_estimators": range(25,100,25) } params

We create a GridSearchCV to which we called model. This is going to include our classifier and hyperparameter grid.

model = GridSearchCV( clf, param_grid = params, cv = 5, n_jobs = -1, verbose = 1 ) model

We fit model to the over-sampled training data.

model.fit(X_train_over, y_train_over)

Now we extract the cross-validation results from the model

cv_results = pd.DataFrame(model.cv_results_) cv_results

We are going to analyze what happen with model with max depth equals 10.

We plot param_randomforestclassifier__n_estimators" on the x-axis and "mean_fit_time" on the y-axis.

# Create mask mask = cv_results["param_randomforestclassifier__max_depth"]==10 # Plot fit time vs n_estimators plt.plot(cv_results[mask]["param_randomforestclassifier__n_estimators"], cv_results[mask]["mean_fit_time"] ) # Label axes plt.xlabel("Number of Estimators") plt.ylabel("Mean Fit Time [seconds]") plt.title("Training Time vs Estimators (max_depth=10)");

We can see that when number of estimators is higher the mean fit time is higher too.

Now we are going to look at max_depth. We are going to analyze the n_estimators equals 25.

We plot "param_randomforestclassifier__max_depth" on the x-axis and "mean_fit_time" on the y-axis.

# Create mask mask = cv_results["param_randomforestclassifier__n_estimators"]==25 # Plot fit time vs max_depth plt.plot(cv_results[mask]["param_randomforestclassifier__max_depth"], cv_results[mask]["mean_fit_time"] ) # Label axes plt.xlabel("Max Depth") plt.ylabel("Mean Fit Time [seconds]") plt.title("Training Time vs Max Depth (n_estimators=25)");

We can see that when max depth is higher, mean fit time is higher too.

Now we extract the best hyperparameters from model.

# Extract best hyperparameters model.best_params_

model.best_score_

model.best_estimator_

Now we are going to predict with the best model.

model.predict(X_train_over)

Evaluate

We are going to evaluate the model.

Calculate the training and test accuracy scores for model.

acc_train = model.score(X_train, y_train) acc_test = model.score(X_test, y_test) print("Training Accuracy:", round(acc_train, 4)) print("Test Accuracy:", round(acc_test, 4))

We beat the baseline.

Now using confucion matrix we are going to see how the model performs.

First we count how many observation in y_test belong to positive and negative class.

y_test.value_counts()

Now we plot hte confusion matrix.

ConfusionMatrixDisplay.from_estimator(model, X_test, y_test);

This model decrease the number of False Positives.

Communicate

We obtain the features and importances of our model and plot them.

# Get feature names from training data features = X_train_over.columns # Extract importances from model importances = model.best_estimator_.named_steps[ "randomforestclassifier" ].feature_importances_ # Create a series with feature names and importances feat_imp = pd.Series(importances, index=features).sort_values() # Plot 10 most important features feat_imp.tail(10).plot(kind="barh") plt.xlabel("Gini Importance") plt.ylabel("Feature") plt.title("Feature Importance");

We can see that the most importan feature is V14.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Credit Card Fraud Detection using Random Forest Classifier