Introduction.
Currently (September - October 2024) work in progress. A second notebook is being worked on, focussing primarily on the email subject column. That has presented its own unique set of challenges as far as linguistics go, with many more subjects being written in foreign languages vs. the email body language. The Somali language was returned as the most influential subject language construct-wise (once the linguistic modelling and PCA / clustering process was implemented post-POS-tagging / linguistic analysis), meaning there have likely been some simple translation tools used to translate the subject across many emails of different languages. The distance to the Somali language was not natural as far as traditional linguistic distance is concerned, with languages such as Romanian and Welsh being among the nearest.
This email dataset - if you didn't already know - is from 2008, this means that things have since changed in both the realism and translation methods of most phishing emails. So i'm aware of the advancements in phishing email defence (especially Google's "pretty impressive" Gmail defence) and the fact that the resulting model at the end of this project will likely not be effective in as many modern instances as it is on this ancient dataset, but this project is being undertaken to get to the bottom of a specific problem within this dataset that has evaded some people for quite a long time.
Cybersecurity Dive brief:
• The financial impact of phishing attacks quadrupled over the past six years, with the average cost rising to $14.8 million per year for U.S. companies in 2021, compared with $3.8 million in 2015, according to a study from the Ponemon Institute on behalf of Proofpoint released Tuesday. Researchers surveyed 591 IT and IT security professionals.
• Companies spent almost $6 million per year on business email compromise (BEC) recovery, which includes about $1.17 million in illicit payments made to attackers annually. Ransomware costs large organizations about $5.66 million per year, including $790,000 in ransom payments.
• The cost of protecting credentials from compromise has also risen sharply, from $381,920 in 2015 to $692,531 in 2021. Organizations are currently seeing about 5.3 credential compromises over a 12-month period, according to the research.
The data.
Reasonably well-balanced classes.
An example of fraudulent and non-fraudulent data:
Topic modeling.
Topic modeling using Latent Dirichlet Allocation (LDA):
Here are the top words associated with each topic. Judging by this alone there will be no issue with a ML model of practically any type parsing the two classes.
The majority of these fraud emails appear to target aspects of male insecurity, such as having a small pee-pee and / or being poor. So as these emails were gathered from an office somewhere, the likely targets will have been Wall St. execs.
Fraudulent Emails: 1. Penis enlargement, pills, and timepieces. 2. Enlargement products and libido. 3. Luxurious items and satisfaction. 4. Replica watches and online stores. 5. Health and enhancement products. 6. CNN alerts and shopping. 7. Rolex and quality watches. 8. Fashionable items and pleasure. 9. Erection and love-related topics. Non-Fraudulent Emails: 1. Python development and files. 2. Workshops and power consumption. 3. Spam messages and technical issues. 4. Learning events and updates. 5. Python updates and patches. 6. Development rules and releases. 7. Documentation and management. 8. Buildbot and reminders. 9. CPU and technical updates.
Polarity.
The average sentiment for non-fraudulent emails is approximately 31.7, while for fraudulent emails, it's around 68.3. So fraudulent emails are a lot more positive in tone. This is visible in many posts on social media also, with (countless) examples such as, "Thank you Mister [insert name here] you changed my life with your trading knowledge" etc. etc., basic social engineering tactics which don't particularly have much effect in the comments section of a FB post, but they often work in phishing emails because they give hope or represent a promise of positivity in an otherwise dull (or flaccid) life.
Languages found in both classes.
The most notable languages in the email body are English, Tagalog, French, Dutch, Afrikaans, Catalan, Danish and Somali.
And the top languages in the email body:
English, Unknown, Korean, Afrikaans, Dutch, Romanian.
With a chi2 of 119 and a p-value of 7.4, there is a significant relationship between the language used in emails and their classification as fraudulent or non-fraudulent.
The average sentiment scores for emails in different languages vary significantly! For example, emails in Slovak (sk), Croatian (hr), and Somali (so) have the highest average sentiment scores, while emails in Japanese (ja), Swahili (sw), and Swedish (sv) have the lowest average sentiment scores.
Typos.
Here are the ten most common languages for typos: 1. Slovak (sl): 45.0 typos 2. English (en): 18.67 typos 3. German (de): 16.33 typos 4. Albanian (sq): 14.0 typos 5. Welsh (cy): 13.5 typos 6. Italian (it): 10.0 typos 7. French (fr): 5.03 typos 8. Croatian (hr): 5.0 typos 9. Catalan (ca): 4.38 typos 10. Polish (pl): 4.0 typos
Popular words in four randomly selected languages across both classes.
Linguistic features.
Linguistic correlations for both fraud and non-fraud classes.
Quite visible differences between the two classes here. The most informative point to note is that the phishing emails contain quite a lot more adverbs, proper nouns and pronouns per sentence than the non-fraudulent emails. The increase in stopwords and punctuation *could* reflect basic grammar compared to those written by technology professionals.
This greater use of adverbs is quite telling (as always for instances of emotional manipulation / persuasive language). These messages are designed to persuade people in as little time as possible and get to the point, hence the use of more nouns and proper nouns. Not shown here is the overall shorter sentence length in the subject column which reflects the requirement for a "grabby" headline, with a much greater adverb count compared to a shorter sentence length.
URL analysis.
The spoofing of media alerts seems to be the primary delivery system of choice. This has been a successful element of phishing campaigns since the mid-00's, even down to using the same spoofed addresses. If you need any sort of explanation as to why it's still a common occurrence (if spammers can circumvent some email filters): "If it ain't broke, don't fix it". Humans are still the weakest link in the chain as far as this goes and that's why, unbelievably, URL spoofing is still going on today.
Non-fraud domains are mostly tech-related which is going to add more beef to any ML model's capability. The classes are so obvious that i'd be surprised if the resulting accuracy wasn't 100%.
Email body.
Fraudulent emails have an average of 110 special characters, while non-fraudulent emails have an average of 257 special characters.
Judging by the KMeans PCA plot, the clusters appear to be well-separated. This suggests that most ML algorithms should find it relatively easy to parse and classify the fraudulent emails.
The difference in the most influential languages between the subject and the email body can be attributed to several factors: 1. Content Variation: The subject and body of emails often serve different purposes. Subjects are typically concise and may use different language patterns compared to the body, which can be more detailed and varied. 2. Linguistic Features: The linguistic features extracted from the subject and body might emphasize different aspects. For example, the subject might focus more on keywords and sentiment, while the body might include more complex sentence structures and vocabulary. 3. Data Distribution: The distribution of languages in the subject and body might differ. Some languages might be more prevalent or have more distinct features in one part of the email compared to the other. 4. Translation and Templates: If translation tools or templates are used differently for subjects and bodies, this could lead to variations in linguistic influence. 5. Purpose and Tone: The tone and purpose of the subject line (e.g., to grab attention) might differ from the body (e.g., to provide detailed information), leading to different influential languages.
Modeling
A recent university research team decided to use only the English emails and a smaller portion of this dataset, after stopword removal (plus a couple of other preprocessing techniques), returned an Extra Trees FP score of 4, and a FN score of around 8. The accuracy was something like 99%. Most models will return high accuracy scores on this data due to the more obvious linguistic patterns noted in the EDA (most notably the topics), however those FP and FN scores will be difficult to draw-down across every language with traditional ML methods because of the more nuanced linguistic patterns, which may require a more advanced ML architecture to unearth. With such high accuracy scores resulting from the comparatively simple ML models, there could be an issue with building too complex a model with advanced architecture, causing it to fit too close to the training data. In that case i'll be left with the option of *adding* data as opposed to removing it. I've noticed certain high correlations to the phishing email stopwords, so I will be leaving stopwords in-place for the final model. From there, I dare say that there won't be any data cleaning at all, because in addition to the correlations mentioned as well as the more advanced model's thirst for more data, I think it could be a good idea to vectorise the punctuation also. In a way this is a good exercise in efficiency because you would want to reduce real-world email processing to a minimum: To clean, de-punctuate and add whatever other preprocessing step was needed to run an email through a pre-trained model at the hourly rate most email providers receive emails == painful.
I experimented with allsorts but the primary algorithms I wanted to implement from the beginning were Dynamic Markov Chains (which might struggle with the weirdness of some of the linguistic issues analysed above), LLM ensembles and LSTMs. Chief of which was a hierarchical LSTM (bidirectional H-LSTM in this case), adding or subtracting features as I progressed, specifically attention mechanisms due to those weird linguistic patterns. I ended up seeing good CV results using an attention layer, Adam optimiser, and naturally for a binary classification problem, a basic-ass sigmoid function.
Considering the ease in which a LR model found the required patterns to return a very respectable accuracy score, and the ease in which a ET model returned a good FP score, as well as the complexity of the H-LSTM model ++ MultiHeadAttention layer, it's likely that this will only require one or two epochs with a decent batch size while experimenting with the num_heads value of the attention layer & dropout rate.
Hope you got all that.
The baseline FP figure for a LR model on the email body is around 360, with a FN of >= 280.
Here are the results from the confusion matrix: - True Positives (TP): 3473 - False Positives (FP): 17 - True Negatives (TN): 4326 - False Negatives (FN): 15 True Positives (TP): These are the fraudulent emails correctly identified by the model. A high TP count indicates that the model is effective at detecting fraudulent emails. False Positives (FP): These are non-fraudulent emails incorrectly classified as fraudulent. A low FP count is desirable as it means fewer legitimate emails are mistakenly flagged as fraud. True Negatives (TN): These are non-fraudulent emails correctly identified by the model. A high TN count shows the model's ability to accurately recognise legitimate emails. False Negatives (FN): These are fraudulent emails that the model failed to identify. A low FN count is crucial as it means fewer fraudulent emails are slipping through undetected. Overall, the model performs well with high TP and TN counts & low FP and FN counts, indicating effective classification of both fraudulent and non-fraudulent emails.