Decoding Mental Health Discussions on Reddit: A Text Classification and Topic Modeling Analysis of "Dreaddit"

by Vikash Giritharan, an undergraduate student at the University of California, Berkeley. BA in Data Science, BS in Business Administration, and Certificate in Entrepreneurship & Technology.

Background

The field of mental health has received increased attention in recent years, particularly in relation to the use of social media as a mechanism for conveying messages about mental health issues. Platforms such as Reddit have become a focal point for individuals to share their experiences, seek support, and discuss various mental health issues. As a student researcher with a keen interest in the field of mental health, I sought to conduct an in-depth analysis of the relationship between the text of social media posts and the subreddit to which they belong, with a view to determining whether the text of a post can be used to predict the type of mental health issue the user is experiencing. By identifying trends in the way individuals discuss their mental health on social media platforms, this research aims to provide useful information not only for medical professionals but also for users seeking assistance through key words, common phrases, and posting practices. I utilized "Stress Analysis in Social Media; Dreaddit: A Reddit Dataset" from Kaggle to conduct my analysis. Through this project, I aim to uncover larger trends in the use of social media as a tool for addressing mental health issues globally.

Installing & Importing Modules for Analysis

Breakdown of all Python modules and packages imported for the project for 6 text classifiers and 1 text modeling tool.

# for all analysis import pandas as pd import numpy as np # for neural network text-classifier from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.neural_network import MLPClassifier from sklearn.metrics import accuracy_score, classification_report # for gradient boosting text-classifier from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import accuracy_score, classification_report # for multinomialNB text-classifier import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import accuracy_score, classification_report # for SVM text-classifier import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC from sklearn.metrics import accuracy_score, classification_report # for logistic regression text-classifier import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report # for random forest text-classifier import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # for text modeling analysis !pip install gensim==4.3.0 import gensim from gensim import corpora from gensim.models import LdaModel from gensim.utils import simple_preprocess from gensim.parsing.preprocessing import STOPWORDS from nltk.stem import WordNetLemmatizer, SnowballStemmer from nltk.stem.porter import * import pyLDAvis import pyLDAvis.gensim_models

Merging Train and Test CSV's

Concatenating two datasets into one dataframe since the Kaggle Dreaddit includes two CSV files: the train and test sets. Then isolating the following columns: "subreddit", "text", "social_karma", "social_num_comments", "social_upvote_ratio", "sentiment".

df1 = pd.read_csv("dreaddit-train.csv") df2 = pd.read_csv("dreaddit-test.csv") # concatenate the dataframes vertically (rows-wise) df = pd.concat([df1, df2]) df = df[["subreddit", "text", "social_karma", "social_num_comments", "social_upvote_ratio", "sentiment"]] df

Text-Classifiers

A text-classifier was created in order to understand if the Dreaddit dataset can be deconstructed to look for trends between the "text" column and the "subreddit" column. More specifically, can the text from a post about someone's experience with mental health be easily classified by their subreddit/mental health concern?

Neural Network Text-Classifier

Starting off with a neural network text classifier in hopes of finding high prediction accuracy. A neural network text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can automatically learn complex patterns and relationships in the data, even in large and unstructured text data. It can effectively classify the subreddit based on the text, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a MLPClassifier object and fit it to the training data clf = MLPClassifier() clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

This model has an accuracy of 0.53 or 53%. While by conventional standards, this is quite ineffective, it may point to the fact that many mental health concerns can include very similar keywords about one's experience. To establish this, additional text classifiers will be made.

Gradient Boosting Text-Classifier

The second text classifier was through the gradient boosting classifier in order to find higher prediction accuracy. A Gradient Boosting text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can effectively learn from the mistakes of previous models and iteratively improve the predictions. It can handle large dataset with high dimensionality, and it is less prone to overfitting than other algorithms. It can also handle different types of features such as text, numeric, and categorical and provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a GradientBoostingClassifier object and fit it to the training data clf = GradientBoostingClassifier() clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Unfortunately, this model has an accuracy of 0.49 or 49%, lower than the previous neural network classifier.

MultinomialNB Text-Classifier

The third text classifier was through the multinomialNB classifier in order to find higher prediction accuracy. A Multinomial Naive Bayes text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it is a simple and fast algorithm that can effectively handle large datasets and high-dimensional text data. It is based on the assumption of independence of the features, which is suitable for text classification tasks. It can handle features such as the frequency of words, making it well-suited for text data. This classifier can provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a MultinomialNB object and fit it to the training data clf = MultinomialNB() clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Unfortunately, this model has an accuracy of 0.45 or 45%, lower than the previous neural network and gradient boosting classifiers.

Support Vector Machines (SVM) Text-Classifier

The fourth text classifier was through the support vector machines classifier in order to find higher prediction accuracy. An SVM text classifier can be useful for identifying trends between the text and subreddit columns of the Dreaddit dataset because it can accurately classify text data and handle high dimensionality. It can also handle non-linearly separable data, making it well-suited for text classification tasks such as this one. It can provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a LinearSVC object and fit it to the training data clf = LinearSVC() clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Fortunately, this model has an accuracy of 0.533 or 53.3%, higher than the previous neural network, gradient boosting, and multinomialNB classifiers. Overall, however, such models are not effective for the classification of text from posts on mental health.

Logistic Regression Text-Classifier

The fifth text classifier was through the logistic regression classifier in order to find higher prediction accuracy. A Logistic Regression text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can handle large dataset and high-dimensional text data, it can also handle different types of features such as text, numeric and categorical and provide a good performance on this dataset. It can provide a probability score for each class and this can be used to identify the trends between the text and subreddit columns, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a LogisticRegression object and fit it to the training data clf = LogisticRegression() clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Fortunately, this model has an accuracy of 0.539 or 53.9%, higher than the previous neural network, gradient boosting, multinomialNB, and support vector machine classifiers. Overall, however, such models are not effective for the classification of text from posts on mental health.

Random Forest Text-Classifier

The sixth and final text classifier was through the random forest classifier in order to find higher prediction accuracy. A Random Forest text classifier can be a valuable tool for uncovering trends between the text and subreddit columns in the Dreaddit dataset. Its ability to handle large and complex text data, as well as various types of features, makes it well-suited for identifying patterns and relationships within the dataset. The feature importance it provides can also aid in understanding the key factors that contribute to classifying the subreddit of a post. This can lead to a more accurate prediction of the underlying themes and topics being discussed in the posts, providing valuable insights about mental health discussions on social media.

# split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df['text'], df['subreddit'], test_size=0.2, random_state=42) # create a TfidfVectorizer object and fit it to the training data tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(X_train) # transform the testing data using the TfidfVectorizer object X_test = tfidf.transform(X_test) # create a RandomForestClassifier object and fit it to the training data clf = RandomForestClassifier(n_estimators=100) clf.fit(X_train, y_train) # predict the subreddit for the testing data y_pred = clf.predict(X_test) # print the accuracy and classification report print("Accuracy:", accuracy_score(y_test, y_pred)) print(classification_report(y_test, y_pred))

Unfortunately, this model has an accuracy of 0.46 or 46%, lower than the previous classifiers. All in all, it did not prove an effective classification process between the text and the corresponding subreddit.

Text Modeling

Rather than utilize a text-classification model to understand which subreddit a post belongs to, topic modeling is a technique used to automatically identify the underlying themes or topics present in a large collection of text data; it's a form of unsupervised learning where the goal is to discover the abstract "topics" that occur in a collection of documents. This allows us to understand the main topics that are being discussed in the text data and how they are distributed across the documents. It can be useful for many applications, such as text summarization, text classification, or to gain insight into the content of a dataset.

Text Modeling Analysis

Isolating keywords through text modeling analysis from each of the subreddit posts.

np.random.seed(2023) # create a list of lists where each sublist contains the preprocessed words of a post texts = [simple_preprocess(post, deacc=True) for post in df['text']] # remove stopwords and words that appear less than 5 times texts = [[word for word in text if word not in STOPWORDS] for text in texts] dictionary = corpora.Dictionary(texts) dictionary.filter_extremes(no_below=5) # create a bag-of-words representation of the texts corpus = [dictionary.doc2bow(text) for text in texts] # train the LDA model lda_model = LdaModel(corpus=corpus, num_topics=10, id2word=dictionary) # print the topics and their top words topics = lda_model.print_topics(num_words=10) for topic in topics: print(topic) # visualize the topics using pyLDAvis pyLDAvis.enable_notebook() vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary) pyLDAvis.display(vis)

From the topic modeling technique applied to the Dreaddit dataset, it was possible to identify the main themes and topics discussed in the "text" column, which contains the posts from the subreddits 'ptsd' and 'depression'. The technique revealed that several different topics were being discussed within the posts such as, symptoms, treatment, and coping mechanisms for mental illnesses, personal experiences, and support seeking. Additionally, it was able to uncover the most probable group of words that are present across all the documents, such as "anxiety", "therapy", "medication" and "trauma", which can be useful to gain insight into the content of the dataset and understand the specific mental health issues that the users are discussing. This analysis also showed that there are hidden patterns or relationships in the data that may not have been immediately obvious, highlighting the complexity of mental health and the importance of considering multiple perspectives when studying it.

Conclusion

In conclusion, this analysis of the Dreaddit dataset using text classification and topic modeling techniques has shown that it is not easy to identify which subreddit a post belongs to based solely on the text of the post. The text classifiers had varying levels of accuracy, with some achieving higher performance than others. However, by using text modeling to gain a better understanding of the underlying themes and topics present in the text data, I was able to uncover a larger set of trends and patterns in the types of mental illnesses being discussed in the posts. This highlights the complexity and difficulty of identifying, diagnosing, and treating mental illnesses, even when individuals are sharing honest accounts of their experiences. The analysis also brings attention to the importance of considering multiple perspectives and utilizing different techniques when studying mental health, as it can provide a more comprehensive understanding of the issues at hand.

.css-15w88e5{color:var(--chakra-colors-fg-neutral-primary);font-weight:inherit;letter-spacing:-0.09px;}Decoding Mental Health Discussions on Reddit: A Text Classification and Topic Modeling Analysis of "Dreaddit"