Topic Modelling Demo
nltk to remove stopwords and tokenize,
gensim for the LDA and bigram models, and
pyLDAvis for the interactive topic visualization.
- Remove stopwords, emails, newline characters, apostrophes, etc.
- Tokenize (turn each document into a list of individual words)
- Turn it into phrases through bigram and trigram models
- Lemmatize the model
Build the model
After creating an id-word dictionary to identify keywords from the lemmatized data and developing the corpus, build the LDA model from the gensim library and run.
Visualize the model
You can print the keywords of the model to identify key topics (or themes from the words) or visualize in a topic graph using the pyLDAvis library.