What moms really want

In the week before Australia's Mothers Day, I came across so many articles on what moms want that seemed to completely miss the point. As a mom, I had a strong feeling that I am not alone in wanting something different from a new kettle or a necklace. I decided to test my hypothesis using data from a Reddit post. This post discusses both the process and the results of my little NLP experiment.

Libraries

First, I imported some Python libraries I was going to need. I used Pandas, a data manipulation and analysis Python library, to structure the outputs of the classification model and save them as an .csv file.

import pandas as pd

Second, Praw - Python Reddit API Wrapper - helped me extract the data I needed from the relevant Reddit post.

import praw

Third, transformers, a HuggingFace library allowing to easily download and train state-of-the-art pre-trained language models. The pipeline wrapper providing a simple interface for inference, conveniently combining a tokeniser and a model.

from transformers import pipeline

Scraping Reddit

The quick start documentation is great and I follow it to set everything up (if you use it replace the capslock values below with your authorisation details as described in the docs).

reddit = praw.Reddit(client_id='CLIENT_ID', client_secret='CLIENT_SECRET', user_agent='USER_AGENT')

Next step was to scrape the post, that is to extract the relevant information from it for further analysis.

submission = reddit.submission(url="https://www.reddit.com/r/Parenting/comments/135yq5x/what_do_moms_really_want_for_mothers_day/")

comments = [] submission.comments.replace_more(limit=0) for comment in submission.comments.list(): comments.append(comment.body)

Let's check how many text samples it extracted.

len(comments)

Nice! 494 samples is a reasonably sized dataset for my little experiment.

Zero-shot classification

As I did not know in advance into which classes the responses can be classified, I decided to use zero-shot classification pipeline from Transformers. The model I chose - "facebook/bart-large-mnli" - is most commonly used for this task. It was trained for the Natural Language Inference (NLI) task - predicting if one text sequence (so-called premise) supports or contradicts another text sequence (hypothesis). Using a model trained on one task for another task is called transfer learning. The NLI model reframes each input sequence as a NLI premise and use creates hypotheses from the labels I provide to it to then check these hypotheses' relationship with each of the premises. The probabilities for entailment and contradiction are then converted into confidence (probability) scores for each of the labels.

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

How to choose labels? I did very basic Internet research and also drew from my own motherhood experience. What do people usually think of as appropriate Mothers Day gifts? What do moms really want? For the first experiment, I created a mix of both.

candidate_labels = ['clean house', 'home-cooked meal', 'restaurant', 'hotel', 'rest and sleep', 'flowers', 'jewellery', 'alcohol', 'spa, massage, pampering', 'time alone', 'gift', 'card', 'peace and quiet', 'no decision making']

To classify the data, I passed the last of comments and the list of labels to the classified along with multi_label parameter set to True. Each text sample may have more than one topic, and I want the classification results to reflect this fact.

Without using CPU, it took the model over 4 hours to finish the task.

results = classifier(comments, candidate_labels, multi_label=True)

Visualisation

One way to present the results of the experiment is creating a Pandas DataFrame.

multi_topic_prediction = pd.DataFrame(results)

To exclude low-probability predictions from the DataFrame, I set the probability threshold to 0.6 (this value is arbitrary). This means that the predictions in which the model was not so sure will not be shown in the final results. These low-probability predictions can then be used to add some extra labels to the label list for the next classification iteration (but this would require a manual review of the results).

threshold = 0.6

multi_topic_prediction = multi_topic_prediction.set_index('sequence')\ .apply(pd.Series.explode)\ .reset_index() multi_topic_prediction = multi_topic_prediction[multi_topic_prediction['scores'] >= threshold] multi_topic_predictioqn

Another visualisation method for the classification results is a bar chart. It is, of course, more interpretable than just a table:

Now I save the output as a .csv file and head to Tableau to experiment with other, more advanced visualisation methods.

Oone of the numerous ways to represent the outputs of the model is a treemap (you can view the visualisation here). A categorical palette of colours represents labels (classes), each colour assigned to a specific class. Each of the coloured rectangles contains multiple smaller nested rectangles, each representing a text sample. The samples with higher-score prediction probability are located at the top left and the prediction score discreases in the direction of the bottom right corner (with the nested rectangle size also reflecting the value of score).

Summary

multi_topic_prediction.to_csv('moms_want.csv')