from collections import defaultdict, Counter import difflib import enchant import gensim from gensim import corpora from gensim.utils import simple_preprocess from imblearn.over_sampling import RandomOverSampler, SMOTE import matplotlib.pyplot as plt import numpy as np import nlpaug.augmenter.word as naw import nltk from nltk import FreqDist from nltk.corpus import stopwords, wordnet from nltk.sentiment.vader import SentimentIntensityAnalyzer from nltk.stem.wordnet import WordNetLemmatizer import os import pandas as pd import plotly.express as px import pprint import re import seaborn as sns from sklearn import metrics from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer from sklearn.metrics import accuracy_score, classification_report from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import LabelEncoder import string from textblob import TextBlob, Word, Blobber from transformers import AutoTokenizer, AutoModelForCausalLM import wordcloud from wordcloud import WordCloud, STOPWORDS import warnings warnings.filterwarnings("ignore")

Introduction.

I'm interested in psychology and ultimately would like to see if there is anything "in it" with MBTI typology and the machine learning side of predicting character type via chat content. I know people will talk about different subjects based on their socio-economic status, upbringing and an array of other variables, and as far as I know, MBTI typology was more of a loose method by which employers and psychologists typed people, so I'm slightly sceptical about the accuracy of predicting a user's personality from their online chat content.

What I wanted from this project initially was for the model to predict a user's type by asking the user to enter a string of chat content of their choosing outside of thte context of the training data, and if not, what can be done to make it work. As far as I can see from importing the data and eyeballing a few of the values, personality could be very difficult for ML to predict when the user inputs a string of text to be analysed because there will be a high probability of the user entering text that's outside the context of the chat data used in this model (such is the nature of machine learning). Ergo: Me, as an ENTP, could have a conversation with you - whatever your MBTI type may be - about a subject and use a lot of the same words while discussing it, so for these personality types to be properly ascertained, the type-specific chat content would likely require type-specific identifiers within those chats and that could really be a bit of a tall order for this dataset. Writing this paragraph after completing half of the EDA to provide a better example, for my MBTI type to be recognised by this model I would have to talk about drugs, loneliness and sex a lot for this model to identify my type (*facepalm*). So please be aware of this if you're going to use the model to predict your type using the function at the end of the project; even returning a high degree of accuracy from the model, bear in mind that it will be trained on a certain type of chat written by people who claim to be of a certain type, discussing issues that have a high probability of being of a different context to what you are about to input to the model. Meaning even though you're an ISTJ, the thing you input could be exactly the same as an ENFP's chat here. This has spurred me on to build a bigger, better dataset of the same nature so that I can help people better understand their type because I feel that some of the typology websites questions aren't as accurate as they could be, but this a long process and I won't be featuring that for a good while yet.

Additionally, we have to consider the data source; online chats written by members of the public who may not be the MBTI type they profess to be, in part due to the accuracy of the typology website used to gauge their MBTI type and more likely, the user mis-typing themselves for whatever reason.

That being said, properly parsing personality from chat data does stand to reason with a couple of the aspects of MBTI typology; NT types are definitely more logical and analytical than not, etcetera. Introverts who gain their energy from solitary pastimes are more likely to spend time online absorbing and discussing internet culture and what-not, but thinking about the big picture, I still don't think there are too many linguistic differences here for a model to be reliably trained on but I would like to be sure.,

Finally, there is much class imbalance here so some form of oversampling will be required for some of the types, so training a model with oversampling and without oversampling but with plenty of cross-validation will be another interesting part of the project.

Hokay, let's git 'er done...

df = pd.read_csv("/work/mbti_1.csv")

The data.

Dataframe head.

Nothing out of the rdinary here - just the two main columns, one containing the MBTI type and the other containing the corresponding chat info.

Dataframe info.

And nothing in the way of null values.

Analysis.

Distribution of values.

The value counts for each character type. Introverted intuitive types ('IN') are the most common posters here, with the extroverted sensors ('ES') being the least common, likely due to having day jobs ;-)

A histogram of the distribution of values across each MBTI type showing that the introverted intuitive types INFP, INFJ, INTP and INTJ hold the four top counts.

The four MBTI types holding the least counts are the extroverted 'sensors' (?... I think I have that right) ESTJ, ESFJ, ESFP and ESTP.

So with between 39 and 89 counts for the four least common types here, and between 1091 and 1832 counts for the most common types, I'm going to stick my neck out and say that the accuracy will be quite low and the model will require plenty of experimentation. Plus, I don't think 39 chats will bee enough to help an oversampling method if the end function is to accurately gauge a user by inputting some text. That text input would have to be extremely accurate context-wise if ESTJ is to be predicted even with the help of SMOTE. So, the model accuracy could well be taken as gospel, but the accuracy of predicting the user's text will more than likely be inaccurate unless it's predicting one of the top four or five MBTI types here.

A word cloud containing the most common words used by all character types shows more evidence of a difficult end prediction, with plenty of common words being distributed among the types. "Think", "People" and "Want" will be spread around all types by a large degree.

Sentiment.

Creating dataframes for posts written by the introverts and the extroverts so I can chart the polarity for both.

The introverts:

The extroverts:

def polarity_plot(string): def _polarity(string): return TextBlob(string).sentiment.polarity polarity_score_result = string.apply(lambda x: _polarity(x)) polarity_score_result.hist()

Polarity plot of the entire dataset.

Polarity for the overall posts shows the majority value sitting in the 0.1 - 0.2 region, so, pretty positive overall.

Introvert and extrovert polarity plots.

The introverts vs. the extroverts.

This shows the introverts posted marginally less positive content than their extraverted counterparts which is a good start.

Querying the 'NT' types and the 'ST' types.

Querying the 'NT' data.

Querying the 'ST' data.

Interesting, the 'NT' types' posts are quite a bit less positive than the 'ST' types. I suppose this is the curse of being analytical.

Querying the 'NF' and 'SF' types.

'NF' vs. 'SF' type polarity plots.

Who'da thunk it, the polarity for the 'NF' and 'SF' types seems to be the most positive overall, with the 'SF''s pipping the 'NF''s to the 0.2 range and straddling that range a tad more than the 'SF''s.

Querying the perceiving types and the judging types.

Perceiving type vs. judging type polarity plots.

And the Judging types are more positive than the perceivers.

Data cleaning.

Removing URLs.

df.posts = df.posts.str.lower()

def remove_urls(text): url_pattern = r'https?://\S+|www\.\S+' text = re.sub(url_pattern, '', text) return text

df['posts'] = df['posts'].apply(lambda x: remove_urls(x))

Stripping any extra whitespace.

df = df.applymap(lambda x: x.strip())

Regex query to leave only alpha chars:

df.posts = df.posts.str.replace('[^a-zA-Z]', ' ', regex=True).str.strip()

Checking the stopwords are kosher and applying the method to the data.

stop_words = list(stopwords.words('english'))

df.posts = df.posts.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

Average word length by type.

Looking at the average word length for each type, there isn't much in it, the average length ranges between 5.50 and 5.74, so this isn't a huge amount of information with which a ML model can work with as far as pattern recognition goes.

The NT types hold the top four most popular entries here.

Keyword averages.

As an ENTP, I'm ashamed to say the counts for words such as 'bored' and 'drugs' are quite outstanding compared to any other word, but I use motorbikes and data analysis in place of both.

There do seem to be stereotypical patterns here though, with ENFP, ENFJ and INFP mentioning the word 'love' more than most.

ISTP and ISTJ mention the word 'hate' more than most, something I've recognised among my friends who type as ISTP especially is that they don't hate on people per-se, they just use the word to describe something that's a bit crap, like jogging (strange considering a lot of them are military).

ENTP, INTP, INTJ and ENTJ are below average for the word 'happy' and 'sad', another thing that isn't exactly inaccurate.

The 'NF' types are above average for the word 'sad' as well which is somewhat stereotypical.

ENTJ has shuffled up the chart like a rat up a drainpipe once the word 'angry' was introduced, for... reasons.... unbeknownst to anyone...

.....

ESFJ flying the flag for the duality of man, holding the top average for both 'excited' and 'depressed', respectively.

And the INTPs holding some of the highest averages for the words 'smoke', 'drugs' and 'alcohol'. No further comment, your honour.

Cussword analysis.

Those of a sensitive nature may want to look away now.

A chart visualising the amount of sweary words for each type.

INTP, ENTP and INTJ are the winners here, obviously. INFP and ISTP aren't far behind.

Cussword distribution.

The percentage of swearwords per type vs. actual word percentage per type.

Non-cussword distribution.

Displaying the percentage of dataframe words for each type for comparison. The INTPs do quite well here, with 15% of the total word count and 17.4% of the total swear word count!

df['word_count'] = [len(x.split()) for x in df['posts'].tolist()]

df['word_percentage'] = df['word_count'] / df['posts'].str.split().str.len() * 100

Cusswords vs. positive words.

An interesting personal experiment; I'd like to take 22 nice, positive words to counter the list of 22 swearwords and do an overlay or violin plot for both wordlists.

It seems the positive words begin to dwindle as the swearwords add-up...

Nice vs. cussword distribution by type.

We can see the INTP drop down a position when it comes to the use of (certain) positive words, sitting at 15% post count, 17.4% swearwords and 11.8% positive / 'nice' words. Granted, the INFPs are responsible for the majority post counts so they will always be hitting the numero uno count in these instances, but it's good to see that their use of positive words outweighs their use of swearwords by around 6%. The INTJs account for 12.6% of the total post count, 13.3% of their posts are swearwords and 10% are positive words. The ENFPs score reasonably well here, with 7.78% of the total post count, they only swear 6.77% of the time and appear to use 9.74% positive words.

O' the thirst.

Checking out which types are the thirstiest between introverts and extroverts by selecting some choice words of a sexual nature, hidden in case of younger viewers present. There are only five or six words and they are quite immature but I think this will help parse out some character types.

Distribution of words of a sexual nature by type.

INFP, INTP, INFJ and INTJ are the most common users of (childish) words of a sexual nature.

And no surprises here whatsoever.

Common words by MBTI type.

The confusion for the model is visible here, with similar word distributions being present in each types' chat, especially the words 'Think' and 'People', 'Know' and 'One' etc.. Experimenting running different models with the removal of some of these words will take some time.

My initial line of thought was to remove the mention of certain MBTI types from the posts, but it seems that specific types address their own type more than some others, so I will leave the types in the posts for the sake of the types that don't have many posts here.

MBTI types are more of an indication of how we process information so gauging type by text without the author mentioning their MBTI type will be difficult for most types, picking up on stress and anxiety among authors in other projects has proved easy, but as already mentioned, an ESFJ could have a very similar vocabulary to an INTP depending on their life experience, providing the biggest challenge here without a lot of data.

Will the removal of commonly-shared words help?

The percentage of these words is very similar between types, a figure ranging between 19% - 24%, so unless it's in the case of the ESFJ who has a relatively high figure here, I feel that the inclusion of some of these words could harm accuracy.

Removing some of these common words from the chat data. The position of these words within the chat may help accuracy so it would be best to not be too cavalier with this. for example, the word 'feel' can be used in many different instances and could even be quite type-specific.

Least-common words.

I thought I would take a look into the least common words because I feel that typos will probably appear less frequently than most other words. The manner in which I do this will not be perfect but it will be computationally inexpensive in comparison to TextBlob et. al.. Additionally, I think some non-dictionary words normally classed as typos will be popular among some character types due to the words being synonymous with internet culture or whatever, so removing too many infrequent words could prove to be inaccurate in some cases.

Creating a list of the thirty least common words for each type, converting that list of lists into one long list, iterating through that to compare the words to words which aren't in the English dictionary, placing those words into their own list called 'not_in_dict' then removing those words from df.posts.

least_common_words = [] for x in df.type.unique(): least_common_words.append(df[df['type'] == x].posts.str.split(expand=True).stack().value_counts().tail(30).index.tolist())

least_common_words = [item for sublist in least_common_words for item in sublist]

not_in_dict = [] new_line = '\n' for x in least_common_words: if not wordnet.synsets(x): not_in_dict.append(x) print(f"The words that are not in the dictionary are: {new_line}{not_in_dict}{new_line}")

Typos.

A wordcloud of the typos. There are some words such as babcock and californians which are lowercase so will likely be treated as typos as a result.

A chart representing the 'typo' counts and a dictionary below the chart with the indices of the words. I will experiment with the removal of all or some of these.

non_english_words_count = [] for x in not_in_dict: non_english_words_count.append(df.posts.str.count(x).sum()) non_english_words_count = pd.DataFrame(non_english_words_count, columns=['count'])

Average of typos for each MBTI type.

It seems accuracy is better when the words from not_in_dict are left in the chat data. I was experimenting with different strategies such as removing the 'www''s and 'youtube''s which were still in the chats after the URL treatment had been applied, but these words were synonymous with certain types and their presence helped accuracy.

Looking at the distribution of other typos for each MBTI type though (below), we do see some identifiers such as the ESTJ and ISTJ more commonly using what one could class as 'typos', and the INTJ being a little more precise with their grammar. The good thing is that the typo average is relatively high for the majority class (INFP), almost 15%, so this should go some way to help accuracy.

Let's also not forget the internet is a global thing and some of these words could be a non-English speaker misspelling certain words. And I have seen many instances of words like 'of' being spelled 'pf' due to keyboard proximity. So there are strategies which will cure and kill accuracy both depending on MBTI type and / or geolocation.

df.posts = df.posts.str.replace('eas', 'was') df.posts = df.posts.str.replace('pf', 'of') df.posts = df.posts.str.replace('ons', 'one') df.posts = df.posts.str.replace('vo', '') df.posts = df.posts.str.replace('lo', '') df.posts = df.posts.str.replace('om', '') df.posts = df.posts.str.replace('una', '')

The most common words.

(Post data-cleaning)

corpus = [] words = df['posts'].str.split() words = words.values.tolist() corpus = [x for y in words for x in y]

counter = Counter(corpus) common = counter.most_common() x, y = [], [] for string, count in common[:30]: if string not in stop_words: x.append(string) y.append(count)

Lemmatizing the text.

lemmatize = WordNetLemmatizer() df['posts'] = df['posts'].apply(lambda x: lemmatize.lemmatize(x))

nltk.download('averaged_perceptron_tagger')

POS tagging.

# corpora = [] # for x in df.posts: # corpora.append(x) # corpora = ' '.join(corpora) # corpus = nltk.word_tokenize(corpora) # pos = nltk.pos_tag(corpus) # print(pos[:10])

# pos = pd.DataFrame(pos, columns=['Word', 'POS']) # df['pos'] = pos['POS']

Modeling.

As this is a large-ish dataset, I was admittedly strapped for hardware. Cross-validating with five 'C' values, five max_iter values and a couple of solvers took an extremely long time even with a relatively good, modern GPU without SMOTE imputation for the imbalance. So this result could be a good 10%-20% better with the right hardware. And ultimately I was quite surprised at how little data cleaning was required to get the accuracy up from an initial low value to almost 70%.

After experimenting with them all (Word2Vec, Doc2Vec, Linear SVM, Multinomial NB, BOW with TF etc.), Logistic Regression came out head & shoulders above the rest after a lengthy training session for optimal parameters.

Label encoder.

LabelEncoding the type column into 'type_label', dropping the type column, printing the types and their corresponding labels to refer back to once I've made a prediction.

le = LabelEncoder() df['type_label'] = le.fit_transform(df.type.values) type_dict = dict(zip(le.classes_, le.transform(le.classes_))) print(type_dict) df = df.drop(["type"], axis = 1)

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(df['posts'])

y = df['type_label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

smote = SMOTE(random_state=42) X_train, y_train = smote.fit_resample(X_train, y_train)

import lightgbm from lightgbm import LGBMClassifier lgbm_model = LGBMClassifier() lgbm_model.fit(X_train, y_train)

lgbm_model.score(X_test, y_test)

Classification report.

So as expected, one or two of these are very rarely recognised, and with macros between 0.53 and 0.64 the overall model is 'just about good enough with around half being junk content' which isn't what I would like to settle for in any case. However, the ISFP is the most correctly identified with a precision of 0.76, that isn't particularly being very well validated with its 0.52 recall so its f1 score represents a half-decent harmonic balance. The INTP has the best f1 score, followed by the INFP and the INFJ, all of which have f1 scores above 0.70. So this model will struggle to parse quite a few character types (especially those with low f1 scores as was to be completely expected from the EDA) and shouldn't be trusted completely, but it will do "ok" half of the time.

auc = metrics.roc_auc_score(y_test, lgbm_pred) fpr, tpr, thresholds = metrics.roc_curve(y_test, lgbm_pred) plt.plot(fpr, tpr, label='ROC curve (area = %0.3f)' % auc) plt.plot([0, 1], [0, 1], 'k--') # random predictions curve plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.0]) plt.xlabel('False Positive Rate or (1 - Specifity)') plt.ylabel('True Positive Rate or (Sensitivity)') plt.title('Receiver Operating Characteristic') plt.legend(loc="lower right") plt.show()

Predictions on text.

Making a prediction on a line of text using a quote from Nikola Tesla (INTJ). Whether this quote is in the data is unknown to me, if it is then this will be easy, and if it isn't in the data, I'll be surprised if the model gets it.

my_sample = ["I don’t care that they stole my ideas… I care that they don’t have any of their own."]

def get_personality_type(index): for key, value in type_dict.items(): if index == value: return key type_result = get_personality_type(lr_model.predict(vectorizer.transform(my_sample))[0]) print(f"The predicted personality type is: {type_result}. ")