Decoding Mental Health Discussions on Reddit: A Text Classification and Topic Modeling Analysis of "Dreaddit"
by Vikash Giritharan, an undergraduate student at the University of California, Berkeley. BA in Data Science, BS in Business Administration, and Certificate in Entrepreneurship & Technology.
Background
The field of mental health has received increased attention in recent years, particularly in relation to the use of social media as a mechanism for conveying messages about mental health issues. Platforms such as Reddit have become a focal point for individuals to share their experiences, seek support, and discuss various mental health issues. As a student researcher with a keen interest in the field of mental health, I sought to conduct an in-depth analysis of the relationship between the text of social media posts and the subreddit to which they belong, with a view to determining whether the text of a post can be used to predict the type of mental health issue the user is experiencing. By identifying trends in the way individuals discuss their mental health on social media platforms, this research aims to provide useful information not only for medical professionals but also for users seeking assistance through key words, common phrases, and posting practices. I utilized "Stress Analysis in Social Media; Dreaddit: A Reddit Dataset" from Kaggle to conduct my analysis. Through this project, I aim to uncover larger trends in the use of social media as a tool for addressing mental health issues globally.
Installing & Importing Modules for Analysis
Breakdown of all Python modules and packages imported for the project for 6 text classifiers and 1 text modeling tool.
Merging Train and Test CSV's
Concatenating two datasets into one dataframe since the Kaggle Dreaddit includes two CSV files: the train and test sets. Then isolating the following columns: "subreddit", "text", "social_karma", "social_num_comments", "social_upvote_ratio", "sentiment".
Text-Classifiers
A text-classifier was created in order to understand if the Dreaddit dataset can be deconstructed to look for trends between the "text" column and the "subreddit" column. More specifically, can the text from a post about someone's experience with mental health be easily classified by their subreddit/mental health concern?
Neural Network Text-Classifier
Starting off with a neural network text classifier in hopes of finding high prediction accuracy. A neural network text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can automatically learn complex patterns and relationships in the data, even in large and unstructured text data. It can effectively classify the subreddit based on the text, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.
This model has an accuracy of 0.53 or 53%. While by conventional standards, this is quite ineffective, it may point to the fact that many mental health concerns can include very similar keywords about one's experience. To establish this, additional text classifiers will be made.
Gradient Boosting Text-Classifier
The second text classifier was through the gradient boosting classifier in order to find higher prediction accuracy. A Gradient Boosting text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can effectively learn from the mistakes of previous models and iteratively improve the predictions. It can handle large dataset with high dimensionality, and it is less prone to overfitting than other algorithms. It can also handle different types of features such as text, numeric, and categorical and provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.
Unfortunately, this model has an accuracy of 0.49 or 49%, lower than the previous neural network classifier.
MultinomialNB Text-Classifier
The third text classifier was through the multinomialNB classifier in order to find higher prediction accuracy. A Multinomial Naive Bayes text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it is a simple and fast algorithm that can effectively handle large datasets and high-dimensional text data. It is based on the assumption of independence of the features, which is suitable for text classification tasks. It can handle features such as the frequency of words, making it well-suited for text data. This classifier can provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.
Unfortunately, this model has an accuracy of 0.45 or 45%, lower than the previous neural network and gradient boosting classifiers.
Support Vector Machines (SVM) Text-Classifier
The fourth text classifier was through the support vector machines classifier in order to find higher prediction accuracy. An SVM text classifier can be useful for identifying trends between the text and subreddit columns of the Dreaddit dataset because it can accurately classify text data and handle high dimensionality. It can also handle non-linearly separable data, making it well-suited for text classification tasks such as this one. It can provide a good performance on this dataset, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.
Fortunately, this model has an accuracy of 0.533 or 53.3%, higher than the previous neural network, gradient boosting, and multinomialNB classifiers. Overall, however, such models are not effective for the classification of text from posts on mental health.
Logistic Regression Text-Classifier
The fifth text classifier was through the logistic regression classifier in order to find higher prediction accuracy. A Logistic Regression text classifier can be useful for finding trends between the text and subreddit columns of the Dreaddit dataset because it can handle large dataset and high-dimensional text data, it can also handle different types of features such as text, numeric and categorical and provide a good performance on this dataset. It can provide a probability score for each class and this can be used to identify the trends between the text and subreddit columns, potentially providing more accurate predictions and understanding of the underlying themes and topics being discussed in the posts.
Fortunately, this model has an accuracy of 0.539 or 53.9%, higher than the previous neural network, gradient boosting, multinomialNB, and support vector machine classifiers. Overall, however, such models are not effective for the classification of text from posts on mental health.
Random Forest Text-Classifier
The sixth and final text classifier was through the random forest classifier in order to find higher prediction accuracy. A Random Forest text classifier can be a valuable tool for uncovering trends between the text and subreddit columns in the Dreaddit dataset. Its ability to handle large and complex text data, as well as various types of features, makes it well-suited for identifying patterns and relationships within the dataset. The feature importance it provides can also aid in understanding the key factors that contribute to classifying the subreddit of a post. This can lead to a more accurate prediction of the underlying themes and topics being discussed in the posts, providing valuable insights about mental health discussions on social media.
Unfortunately, this model has an accuracy of 0.46 or 46%, lower than the previous classifiers. All in all, it did not prove an effective classification process between the text and the corresponding subreddit.
Text Modeling
Rather than utilize a text-classification model to understand which subreddit a post belongs to, topic modeling is a technique used to automatically identify the underlying themes or topics present in a large collection of text data; it's a form of unsupervised learning where the goal is to discover the abstract "topics" that occur in a collection of documents. This allows us to understand the main topics that are being discussed in the text data and how they are distributed across the documents. It can be useful for many applications, such as text summarization, text classification, or to gain insight into the content of a dataset.
Text Modeling Analysis
Isolating keywords through text modeling analysis from each of the subreddit posts.
From the topic modeling technique applied to the Dreaddit dataset, it was possible to identify the main themes and topics discussed in the "text" column, which contains the posts from the subreddits 'ptsd' and 'depression'. The technique revealed that several different topics were being discussed within the posts such as, symptoms, treatment, and coping mechanisms for mental illnesses, personal experiences, and support seeking. Additionally, it was able to uncover the most probable group of words that are present across all the documents, such as "anxiety", "therapy", "medication" and "trauma", which can be useful to gain insight into the content of the dataset and understand the specific mental health issues that the users are discussing. This analysis also showed that there are hidden patterns or relationships in the data that may not have been immediately obvious, highlighting the complexity of mental health and the importance of considering multiple perspectives when studying it.
Conclusion
In conclusion, this analysis of the Dreaddit dataset using text classification and topic modeling techniques has shown that it is not easy to identify which subreddit a post belongs to based solely on the text of the post. The text classifiers had varying levels of accuracy, with some achieving higher performance than others. However, by using text modeling to gain a better understanding of the underlying themes and topics present in the text data, I was able to uncover a larger set of trends and patterns in the types of mental illnesses being discussed in the posts. This highlights the complexity and difficulty of identifying, diagnosing, and treating mental illnesses, even when individuals are sharing honest accounts of their experiences. The analysis also brings attention to the importance of considering multiple perspectives and utilizing different techniques when studying mental health, as it can provide a more comprehensive understanding of the issues at hand.