Fraud detector - Who's better? Logistic Regressor vs Random Forest Classifier
Introduction
Libraries
As always we need libraries. Numpy and Pandas for data management, Matplotlib, and Seaborn for the graphics. At the same time, and looking for some order, let's import all the Sklearn tools that we are going to use. As we use them I will explain them.
Data
Now, we need some data... But, where is it? Google drive. Why? It is a heavy dataset.
Looks good. Before we start... let's talk about the features.
'Time' is the first column. I know what are you thinking about Time Series management, but not this time. For purpose of this analysis, we don't consider time as a relevant variable. Just for the record, this feature refers to the time in seconds that pass away from the beginning of the sample.
We have 28 columns that show us some "hidden" variables. This, of course, it's because of clients' info protection. But, although we can't primary see it, we can deduce that it's a combination of the client's bank account characteristics. As it says in Kaggle, where the data came from, it "may be the result of a PCA Dimensionality reduction to protect user identities and sensitive features(v1-v28)". However, we don't need to know exactly what it is. Why? Because we have 'Class' column. A boolean way to indicate to us if movements result in fraud or not.
Also, we can see an 'amount' column. It simply shows the money that they spent on that specific movement. Naturally, emerges to me the following question: people, who fraud, do steal lots of little money or lots of money in a few transactions? Well, we got something. I'm thinking that maybe we should check the amount vs fraud scatterplot. But let's not get ahead of ourselves.
Exploring the data
Let's look at some details of the data that help us create better models. We could start with 'columns' and 'description'
Describe give us some important information.
First of all, 'Mean' says that all of the 'V' columns are so close to zero. So, as you can guess, it could be a normal distribution, as a result of standardized data before PCA dimensional reduction. But, 'Amount' isn't standardized. Also, the Min-Max range is huge in every single 'V' column so it could be better to rescale all features. That obeys Scikit learn documentation Both algorithms (Logistic Regressor and Random Forest Classifier) converge better with standardized data.
Anyway, we need to know for sure the kind of distribution of the 'V' columns and the 'Amount' column. So, we calculate skewness, kurtosis, and variance.
Almost everyone is well distributed, but 'V28' and 'Amount'. We will talk about it later.
As you can see, kurtosis exposes a high concentration of some variables. This phenomenon tells us that the tails of those features are very heavy thus means more complicated for models to compute accurate predictions.
My intuition says that V19 to V28 is the less significant components of PCA mentioned. That's the reason for variance near zero. But we don't know that for sure. So let's keep playing with that.
Now, we will see some variables graphically, not all of them, for simplicity, but believe me, they all have the same behavior except for 'Amount'.
It seems that we guess right. V-columns have a normal distribution, zero centered, but too much variance. It could be Standard Scaler work.
What about 'Amount'...
People spent little money in a normal distribution way. So, I think Standard Scaler would be the best option to standardize the data. We are ready for the next step.
Preprocessing
Before modeling we are going to prepare our data, splitting it into train-test sub-datasets and applying the Standard Scaler.
Train - Test Split
To know the optimal training-test split, I think we should know how many frauds there were out of the total number of transactions.
Fraud transactions represent just 0.17% so we are going to need a larger set for training. I propose an 80-20 percent of training-test split respectively.
Once we decide the percentage we use 'train_test_split' with random_state equals to zero and the 'test_size = 0.2'
Scale
We almost got to the models, but as we mentioned before we need to scale. Standard scaler was selected after the analysis that we have done. But I have to say one useful note before:
Reading the StandardScaler's documentation I notice that it has a parameter called 'with_mean' which, in case it is False, assumes that our mean data is zero. So if we only had the 'V' columns we could use this parameter. Although, when I did this modeling for the first time I tried using it (just for fun) and it didn't change the very final result. After all, this parameter won't be used. End of note.
Scaler is ready. Let's give it the shape.
Models
The main event. In this section, we will create the models, fit them and predict with them. After that we are going to measure the performance results in 3 ways: 'accuracy_score', 'confusion_matrix', and 'classification_report'. Time would be measured by the 'timeit' module but I am running this notebook in Deepnote and we always can see the execution time in the sidebar more easily.
Logistic Regressor
'Max_iter' set up the maximum number of iterations to converge the solver. 'Random_state', as I mentioned before, keeps the same random selection to have the same solution every time.
Logistic regressor model is ready. Let's give it the shape.
Now, let's predict using the 20% test dataset.
Ok, everything seems to be good. Once we have the prediction dictated, just left to do the results metrics. Remember we are going to use three different instructions.
Calculating we can see that it was 99.93% of NON-FRAUDS correctly predicted and 87.67% of FRAUDS correctly predicted.
Ok, let's see the accuracy score...
Almost perfect. It works great. Not much to say just Logistic regression is a very good selection.
Get to the next step: classification report... (For me this is the best way to see all the classification results)
First of all, we can see the previous (confusion matrix) results in the precision column. We can say that this model is perfect to predict NON-FRAUDS but not as much to predict FRAUDS.
'Recall' is a very important part because it says that for all instances that were actually positive, what percent was classified correctly. So, as you can see we have 63% of well-classified FRAUD transactions. It might be confusing but put it this way, there were 101 fraud transactions, and Logistic regression was able to detect 64 of them. I think it is a good start but is not reliable enough.
Last but not least, F1-Score. We could see it as an all-around qualification of classification performance. This will be a very good indicator to compare with other models as we will do in the comparison chapter. Also, we can take the 87% of macro average to take both into account.
I think this model would be great to optimize. But I insist is perfect to predict NON-FRAUDS but not as much to predict FRAUD. (I'm not saying that those results wouldn't be helpful)
Random Forest Classifier
Here we have another type of algorithm so we don't use max_iter. We use n_estimators instead. With that, we are setting up the number of decision trees in the forest.
Random Forest Classifier model is ready. Let's give it the shape.
Now, let's predict using the same 20% test dataset.
Once we have the prediction dictated, just left to do the results metrics.
At first glance, you can see the effectiveness of the model. Only 27 miss shouted. It represents 4.7% of all tested data. It means that 99.98% of NON FRAUD transactions were correctly predicted and 93.02% of FRAUD transactions were correctly predicted. It is a really good performance.
Ok, let's see the accuracy score...
Closer to perfection than the Logistic Regressor.
Now, the interesant view...
Precision 1.00 and 0.93 in macro average, awesome. Recall, following the same logic as before (see logistic regression model) there were 101 fraud transactions, and Random Forest Classifier was able to detect 80 of them. All of these result in a 0.93 F1 score on macro average. That is, indeed, an impressive performance.
But (and it is a big 'but') the time of execution was huge in comparison to the model_lr. This happened fitting the models. This will be the main topic in the 'comparison by time of execution' chapter.
Comparison
In this chapter, we will figure out which of our models we could use in which situations. How will we do that? Comparing the accuracy and the execution time of both models.
By classification
We have seen the performance of both models and their classification reports. Now we will see the differences between them to decide which one classifies better. Note: LR means Logistic Regression and RFC Random Forest Classifier.
Let's begin with NON FRAUDS classification. Both models have excellent performance in precision and recall and as a consequence in f1-score. However, if we look more closely we realize that the RFC is just superior. But in practice, we can consider that LR and RFC have potentially almost 100% of well-predicted NON FRAUD transactions. It is not the same with FRAUDS.
One of the moments we have all been waiting for has arrived. One of the differences between LR and RFC is the FRAUD transactions detection. In precision terms, we have 6% RFC more than the LR model.
Recall remarks RFC superiority scoring 79% against 63% by LR. It means that for every 100 FRAUD transactions RFC model is capable to identify 16 more than the LR model. I think this is crucial because what you want is to be able to better identify fraud. Of course, that reflects in the F1 score.
Finally, as a closing subchapter, let's see the macro average in the f1-score of both models so we can compare in general terms.
Ok, let's grounding. If you need to choose one model only directed by performance feature it would be Random Forest Classifier. Both models need to be optimized and find the best configuration for them. But you need to take into account that in time execution big gap exists.
By time of execution
To compare by execution time in a script I usually use the 'timeit' module but here in Deepnote, we can see the execution time by just clicking on the sidebar menu. And if you are following this notebook by running it you can prove that. It shows to you something like this:
The main difference (in time execution) between LR and RFC models was in 'fit' instruction. I'm going to show you the times and after that, we will comment on why this happened.
It took 3.3 seconds to fit the data. In other words, it took 3.3 seconds to map the data and modulate the parameters of its function to understand how the data determines if it is a fraud or not. The above has to do with the mode of operation of the algorithm.
As you see, it is an enormous difference even 10x more than Logistic Regressor. Using only 20 trees!
Ok, two things:
2.- It needs to be optimized. Both times are measured in all-around models. For example, in RFC model could be different time if we find the number of trees that give us an acceptable prediction.
At the end, and following the purpose of this notebook, I need to declare the Logistic regressor as the winner of the time comparison. Of course, you should never choose a model only dictated by one feature.
Summary
We took the transactions data, split it out into train and test subsets, and scaled it.
Later, we create a Logistic regressor model. Fit it and predict with it. So with the prediction, we could be able to measure the f1-score that results to be 0.86 which is very good for classification. In time terms, this model has the best score taking just 3.3 seconds to fit.
On the other hand, we create and fit a Random Forest Classifier model. Later we use it to predict using the test subset. When we were measuring the effectiveness of the model we realize that the f1-score results in 0.92. Way better than LR. But, to fit it took 39.2 seconds, more than 10x than the LR model took.
In summary, by performance, RFC is the best but by time LR would be a better option.
Conclution
Depending on what your limitations are could you choose one or another. If you want a quick solution or if you need to train over and over the Logistic Regressor would be the best choice. But, if you want a more certainty fraud detector (a robust model) and time does not matter much your model would be Random Forest Classifier.
Remember, this analysis was made with non-optimized models. So, maybe we could improve the Logistic regressor to be able to have a higher F1 score. Analogously we can improve the Random Forest by finding the minimum number of trees that perform as well as we want.