Experimenting with SVM vs. LightGBM
Introduction.
This is a highly imbalanced dataset which will benefit from a good dose of downsampling; there are only around 500 fraudulent features, so the remaining non-fraudulent data will be trimmed-down from a size of 285,000 to a satisfactory size for training & validation, and the remaining portion of the data will be saved separately for testing.
In the initial instance, this was a project to outline the lasting clout of the SVM. As one of my favourite systems, it still holds its weight against the majority of other models today for classification tasks, with one-class SVM being as reliable a system for unsupervised anomaly detection tasks...
However (there is always one), the flipside of this project is to outline the power of gradient boosting algorithms... mostly because I cop flak for using LightGBM for practically every tabular data project I tackle. Lately I have seen some newbies on a certain professional networking website advocating for SVM et. al., so I thought this would be a good project to outline why SVM is still a fantastic choice for jobs like this, and to also outline the prowess of gradient boosting algos in comparison. Bear in mind that SVM here has undergone several rounds of gridSearch as well as randomizedSearch, being fine-tuned using the resulting best parameters of those several rounds. LightGBM returned comparably better scores across the board with no parameter tuning whatsoever, although I wanted to use more test data at the end of the project so it underwent a threefold cross-validation round. For this data, the results were better with LightGBM than they were using a PyTorch graph NN, autoencoders and isolation forests. I haven't visualised the data pre-model because there was enough visible information re: model choice using df.head().
First is to separate the fraudulent variables, then save a handful of fraudulent variables aside for testing. The remaining fraudulent variables will be used for validation and the non-fraudulent variables will be downsampled and combined with the validation fraudulent ones for the train_test_split:
And 100 samples have been set aside for testing:
Zero null values in the data:
Scaling the features:
SVC and LightGBM models.
SVC model vs. the test data .
100 samples from the original dataframe as test data, saved earlier:
Test set SVM metrics:
LightGBM trained, then tested using the test dataset.
LightGBM test metrics:
Difference between SVC and LGBM performance, ROC AUC (validation data).
Fraudulent variables prepared for investigation.
SVC:
LightGBM:
False negatives from each model for investigation.
SVC:
LightGBM:
Comparison of averages between fraud variables and false negatives.
The total amount of money lost due to the false negatives:
The sum of money rescued in the true positives:
Percentage of capital misrepresented per 1048 variables through this model:
Both models using 50% of the original dataframe.
Going big with 142,158 variables to see how well things have generalised:
Augmenting 20 rows of "fraudulent" data as all of the original fraudulent variables were used in training and validation:
Inserting them into the test dataframe:
Converting to a workable format:
The SVM model on the augmented "real world" test dataset has returned many more false positives compared to the LightGBM model, as well ass one false negative:
The LGBM model, although trained on a small amount of data, has returned all 20 fraudulent examples and allowed for zero false negatives. There are however 368 false positives which won't cause the business to lose money as false negatives would, although this will cost money in wages due to the time required for an employee to verify the data.
Getting those numbers down as much as possible will be a work in progress from now so changes will be noticeable in this project (training / validation data size etc.), but this outlines the power of LightGBM in instances where strong models such as SVM still occasionally 'miss', especially as far as false negatives go.