Phishing emails

from hyperopt import fmin, hp, STATUS_OK, tpe, Trials from keras.preprocessing.text import Tokenizer from keras_preprocessing.sequence import pad_sequences from langdetect import detect, DetectorFactory from langdetect.lang_detect_exception import LangDetectException from lime.lime_text import LimeTextExplainer import matplotlib.pyplot as plt import networkx as nx import nltk from nltk.corpus import stopwords from nltk import pos_tag, word_tokenize import numpy as np import pandas as pd import pickle import plotly.express as px import plotly.graph_objects as go import random import re from scipy.cluster.hierarchy import dendrogram, linkage from scipy.spatial.distance import pdist, squareform from scipy.stats import chi2_contingency import seaborn as sns from sklearn.cluster import KMeans from sklearn.decomposition import LatentDirichletAllocation, PCA from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer from sklearn.metrics import accuracy_score, auc, classification_report, confusion_matrix, roc_curve from sklearn.metrics.pairwise import euclidean_distances from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.utils import resample, shuffle import spacy from spellchecker import SpellChecker import string from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau from tensorflow.keras.layers import Attention, Bidirectional, Concatenate, Dense, Embedding from tensorflow.keras.layers import Flatten, Input, LSTM, MultiHeadAttention, Reshape, SpatialDropout1D from tensorflow.keras.models import Model, Sequential from tensorflow.keras.optimizers import Adam from textblob import TextBlob import torch import torch.nn as nn from torch.nn.functional import softmax from torch.utils.data import DataLoader, Dataset from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import AdamW, BertForSequenceClassification, BertTokenizer, DistilBertForSequenceClassification, DistilBertTokenizer, RobertaForSequenceClassification, RobertaTokenizer from wordcloud import WordCloud

nlp = spacy.load('en_core_web_sm') nltk.download('punkt') nltk.download('averaged_perceptron_tagger') nltk.download('stopwords')

Introduction.

Currently (September - October 2024) work in progress. A second notebook is being worked on, focussing primarily on the email subject column. That has presented its own unique set of challenges as far as linguistics go, with many more subjects being written in foreign languages vs. the email body language. The Somali language was returned as the most influential subject language construct-wise (once the linguistic modelling and PCA / clustering process was implemented post-POS-tagging / linguistic analysis), meaning there have likely been some simple translation tools used to translate the subject across many emails of different languages. The distance to the Somali language was not natural as far as traditional linguistic distance is concerned, with languages such as Romanian and Welsh being among the nearest.

This email dataset - if you didn't already know - is from 2008, this means that things have since changed in both the realism and translation methods of most phishing emails. So i'm aware of the advancements in phishing email defence (especially Google's "pretty impressive" Gmail defence) and the fact that the resulting model at the end of this project will likely not be effective in as many modern instances as it is on this ancient dataset, but this project is being undertaken to get to the bottom of a specific problem within this dataset that has evaded some people for quite a long time.

"Cybersecurity Dive" brief:

• The financial impact of phishing attacks quadrupled over the past six years, with the average cost rising to $14.8 million per year for U.S. companies in 2021, compared with $3.8 million in 2015, according to a study from the Ponemon Institute on behalf of Proofpoint released Tuesday. Researchers surveyed 591 IT and IT security professionals.

• Companies spent almost $6 million per year on business email compromise (BEC) recovery, which includes about $1.17 million in illicit payments made to attackers annually. Ransomware costs large organizations about $5.66 million per year, including $790,000 in ransom payments.

• The cost of protecting credentials from compromise has also risen sharply, from $381,920 in 2015 to $692,531 in 2021. Organizations are currently seeing about 5.3 credential compromises over a 12-month period, according to the research.

The data.

df_1 = pd.read_csv("CEAS_08.csv")

Reasonably well-balanced classes.

df_1['label'].value_counts()

An example of fraudulent and non-fraudulent data:

spam_example = df_1[df_1['label'] == 1].iloc[1100] spam_example

non_spam_example = df_1[df_1['label'] == 0].iloc[1000] non_spam_example

Topic modeling.

Topic modeling using Latent Dirichlet Allocation (LDA):

df = pd.read_csv("cleaned_subject.csv")

tfidf_vectorizer = TfidfVectorizer(stop_words='english') fraud_subjects = df[df['label'] == 1]['cleaned_subject'] non_fraud_subjects = df[df['label'] == 0]['cleaned_subject'] all_subjects = pd.concat([fraud_subjects, non_fraud_subjects]) tfidf_vectorizer.fit(all_subjects) fraud_tfidf = tfidf_vectorizer.transform(fraud_subjects) non_fraud_tfidf = tfidf_vectorizer.transform(non_fraud_subjects) fraud_lda = LatentDirichletAllocation(n_components=10, random_state=42) fraud_lda.fit(fraud_tfidf) non_fraud_lda = LatentDirichletAllocation(n_components=10, random_state=42) non_fraud_lda.fit(non_fraud_tfidf) def display_topics(model, feature_names, no_top_words): topics = [] for topic_idx, topic in enumerate(model.components_): message = "Topic %d: " % (topic_idx) message += " ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]) topics.append(message) return topics no_top_words = 10 feature_names = tfidf_vectorizer.get_feature_names_out() fraud_topics = display_topics(fraud_lda, feature_names, no_top_words) non_fraud_topics = display_topics(non_fraud_lda, feature_names, no_top_words) fraud_topics, non_fraud_topics

Here are the top words associated with each topic. Judging by this alone there will be no issue with a ML model of practically any type parsing the two classes.

The majority of these fraud emails appear to target aspects of male insecurity, such as having a small pee-pee and / or being poor. So as these emails were gathered from an office somewhere, the likely targets will have been Wall St. execs.

Fraudulent Emails: 1. Penis enlargement, pills, and timepieces. 2. Enlargement products and libido. 3. Luxurious items and satisfaction. 4. Replica watches and online stores. 5. Health and enhancement products. 6. CNN alerts and shopping. 7. Rolex and quality watches. 8. Fashionable items and pleasure. 9. Erection and love-related topics. Non-Fraudulent Emails: 1. Python development and files. 2. Workshops and power consumption. 3. Spam messages and technical issues. 4. Learning events and updates. 5. Python updates and patches. 6. Development rules and releases. 7. Documentation and management. 8. Buildbot and reminders. 9. CPU and technical updates.

Polarity.

fraud_body_cleaned = pd.read_csv("fraud_body_cleaned.csv")

The average sentiment for non-fraudulent emails is approximately 31.7, while for fraudulent emails, it's around 68.3. So fraudulent emails are a lot more positive in tone. This is visible in many posts on social media also, with (countless) examples such as, "Thank you Mister [insert name here] you changed my life with your trading knowledge" etc. etc., basic social engineering tactics which don't particularly have much effect in the comments section of a FB post, but they often work in phishing emails because they give hope or represent a promise of positivity in an otherwise dull (or flaccid) life.

fraud_body_cleaned['sentiment'] = fraud_body_cleaned['cleaned_body'].apply(lambda x: TextBlob(x).sentiment.polarity)

fraud_sentiment = fraud_body_cleaned[fraud_body_cleaned['label'] == 1]['sentiment'].mean() non_fraud_sentiment = fraud_body_cleaned[fraud_body_cleaned['label'] == 0]['sentiment'].mean() fraud_sentiment, non_fraud_sentiment

Languages found in both classes.

The most notable languages in the email body are English, Tagalog, French, Dutch, Afrikaans, Catalan, Danish and Somali.

And the top languages in the email body:

English, Unknown, Korean, Afrikaans, Dutch, Romanian.

language_contingency_table = pd.concat([fraud_languages, non_fraud_languages], axis=1, keys=['Fraudulent', 'Non-Fraudulent']).fillna(0) chi2, p, dof, ex = chi2_contingency(language_contingency_table) chi2, p

With a chi2 of 119 and a p-value of 7.4, there is a significant relationship between the language used in emails and their classification as fraudulent or non-fraudulent.

fraud_body_cleaned.groupby('language')['sentiment'].mean().sort_values(ascending=False)

The average sentiment scores for emails in different languages vary significantly! For example, emails in Slovak (sk), Croatian (hr), and Somali (so) have the highest average sentiment scores, while emails in Japanese (ja), Swahili (sw), and Swedish (sv) have the lowest average sentiment scores.

Typos.

spell = SpellChecker() def count_typos(text): words = text.split() misspelled = spell.unknown(words) return len(misspelled) fraud_body_cleaned['typos'] = fraud_body_cleaned['cleaned_body'].apply(count_typos) language_typos = fraud_body_cleaned.groupby('language')['typos'].mean().sort_values(ascending=False) language_typos

Here are the ten most common languages for typos: 1. Slovak (sl): 45.0 typos 2. English (en): 18.67 typos 3. German (de): 16.33 typos 4. Albanian (sq): 14.0 typos 5. Welsh (cy): 13.5 typos 6. Italian (it): 10.0 typos 7. French (fr): 5.03 typos 8. Croatian (hr): 5.0 typos 9. Catalan (ca): 4.38 typos 10. Polish (pl): 4.0 typos

Popular words in four randomly selected languages across both classes.

Linguistic features.

Linguistic correlations for both fraud and non-fraud classes.

fraud_non_features_body = pd.read_csv('fraud_non_features_body_full.csv') fraud_non_features_body = fraud_non_features_body.drop("Unnamed: 0", axis=1) fraud_non_features_body

Quite visible differences between the two classes here. The most informative point to note is that the phishing emails contain quite a lot more adverbs, proper nouns and pronouns per sentence than the non-fraudulent emails.

The greater use of adverbs on display is quite telling (as always for instances of emotional manipulation / persuasive language). These messages are designed to persuade people in as little time as possible and get to the point, hence the use of more nouns and proper nouns. Not shown here is the overall shorter sentence length in the subject column which reflects the requirement for a "grabby" headline, with a much greater adverb count compared to a shorter sentence length.

URL analysis.

url_pattern = re.compile(r'https?://\S+') df['extracted_urls'] = df['body'].apply(lambda x: url_pattern.findall(x)) all_urls = df.explode('extracted_urls') fraud_urls = all_urls[all_urls['label'] == 1]['extracted_urls'].dropna() non_fraud_urls = all_urls[all_urls['label'] == 0]['extracted_urls'].dropna() fraud_domains = fraud_urls.apply(lambda x: re.findall(r'https?://([^/]+)', x)[0]).value_counts().head(10) non_fraud_domains = non_fraud_urls.apply(lambda x: re.findall(r'https?://([^/]+)', x)[0]).value_counts().head(10)

The spoofing of media alerts seems to be the primary delivery system of choice. This has been a successful element of phishing campaigns since the mid-00's, even down to using the same spoofed addresses. If you need any sort of explanation as to why it's still a common occurrence (if spammers can circumvent some email filters): "If it ain't broke, don't fix it". Humans are still the weakest link in the chain as far as this goes and that's why, unbelievably, URL spoofing is still going on today.

fraud_domains

Non-fraud domains are mostly tech-related which is going to add more beef to any ML model's capability. The classes are so obvious that i'd be surprised if the resulting accuracy wasn't 100%.

non_fraud_domains

Email body.

Fraudulent emails have an average of 110 special characters, while non-fraudulent emails have an average of 257 special characters.

df['special_char_count'] = df['body'].apply(lambda x: sum(not c.isalnum() and not c.isspace() for c in x)) fraud_special_char_count = df[df['label'] == 1]['special_char_count'].mean() non_fraud_special_char_count = df[df['label'] == 0]['special_char_count'].mean() fraud_special_char_count, non_fraud_special_char_count

Judging by the KMeans PCA plot, the clusters appear to be well-separated. This suggests that most ML algorithms should find it relatively easy to parse and classify the fraudulent emails.

The difference in the most influential languages between the subject and the email body: 1. Content Variation: The subject and body of emails often serve different purposes. Subjects are usually concise and might use different language patterns compared to the body, which can be more detailed and varied. 2. Linguistic Features: The linguistic features extracted from the subject and body will emphasise different aspects. EG, the subject might focus more on keywords and sentiment, while the body might include more complex sentence structures and vocabulary. 3. Data Distribution: The distribution of languages in the subject and body might differ. Some languages might be more prevalent or have more distinct features in one part of the email compared to the other. 4. Translation and Templates: If translation tools or templates are used differently for subjects and bodies, this could lead to variations in linguistic influence. 5. Purpose and Tone: The tone and purpose of the subject line (e.g., to grab attention) might differ from the body (e.g., to provide detailed information), leading to different influential languages.

Modeling

# def synonym_replacement(text): # words = text.split() # new_text = [] # # for word in words: # synonyms = wordnet.synsets(word) # if synonyms: # # Choose a synonym with the same part of speech # synonym = synonyms[0].lemmas()[0].name() # new_text.append(synonym) # else: # new_text.append(word) # return ' '.join(new_text) # # df['augmented_text'] = df['cleaned_body'].apply(synonym_replacement) # df[['cleaned_body', 'augmented_text']].head()

A recent university research team decided to use only the English emails and a smaller portion of this dataset, after stopword removal (plus a couple of other preprocessing techniques), returned an Extra Trees FP score of 4, and a FN score of around 8 and an accuracy score of 99%. Most models will return high accuracy scores on this data due to the more obvious linguistic patterns noted in the EDA (mostly the topics), however those FP and FN scores will be difficult to draw-down across every language with traditional ML methods because of the more nuanced linguistic patterns, which may require a more advanced ML architecture to unearth. With such high accuracy scores resulting from the comparatively simple ML models, there could be an issue with building too complex a model with advanced architecture, causing it to fit too close to the training data. In that case i'll be left with the option of adding data as opposed to removing it. I've noticed certain high correlations to the phishing email stopwords, so I will be leaving stopwords in-place for the final model. From there, I dare say that there won't be any data cleaning at all, because in addition to the correlations mentioned as well as the more advanced model's thirst for more data, I think it could be a good idea to vectorise the punctuation too.

The primary algorithms I wanted to implement from the beginning were Dynamic Markov Chains (which might struggle with the weirdness of some of the linguistic issues analysed above), LLM ensembles and LSTMs. Chief of which was a hierarchical LSTM (bidirectional H-LSTM in this case), adding or subtracting features as I progressed, specifically attention mechanisms due to those unnatural linguistic patterns. I ended up seeing good CV results using an attention layer, Adam optimiser, and naturally for a binary classification problem, a basic-ass sigmoid function.

Considering the ease in which a LR model found the required patterns to return a very respectable accuracy score, and the ease in which a ET model returned a good FP score, as well as the complexity of the H-LSTM model ++ MultiHeadAttention layer, it's likely that this will only require one or two epochs with a decent batch size while experimenting with the num_heads value of the attention layer & dropout rate.

Hope you got all that.

The baseline FP figure for a LR model on the email body is around 360, with a FN of >= 280.

max_length = 100 vocabulary_size = 5000 tokenizer = Tokenizer(num_words=vocabulary_size) tokenizer.fit_on_texts(df['body'].astype(str)) sequences = tokenizer.texts_to_sequences(df['body'].astype(str)) X_padded = pad_sequences(sequences, maxlen=max_length) X_train, X_test, y_train, y_test = train_test_split(X_padded, df['label'], test_size=0.2, random_state=42)

input_layer = Input(shape=(100,)) embedding_layer = Embedding(input_dim=vocabulary_size, output_dim=128, input_length=100)(input_layer) dropout_layer = SpatialDropout1D(0.1)(embedding_layer) lstm_layer = Bidirectional(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))(dropout_layer) attention_layer = MultiHeadAttention(num_heads=4, key_dim=64)(lstm_layer, lstm_layer, lstm_layer) flatten_layer = Flatten()(attention_layer) output_layer = Dense(1, activation='sigmoid')(flatten_layer) model = Model(inputs=input_layer, outputs=output_layer) optimizer = Adam() model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=['accuracy'], run_eagerly=True) early_stopping = EarlyStopping(monitor='val_loss', patience=0, restore_best_weights=True, min_delta=0.001) learning_rate_reduction = ReduceLROnPlateau(monitor='val_loss', patience=2, verbose=1, factor=0.2, min_lr=0.001)

history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=1, batch_size=32, callbacks=[early_stopping, learning_rate_reduction])

TP = conf_matrix[1, 1] FP = conf_matrix[0, 1] TN = conf_matrix[0, 0] FN = conf_matrix[1, 0] TP, FP, TN, FN

Here are the results from the confusion matrix: - True Positives (TP): 3473 - False Positives (FP): 17 - True Negatives (TN): 4326 - False Negatives (FN): 15 Overall, the model performs well with high TP and TN counts & low FP and FN figures.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Introduction.