Topic Modelling Demo
Uses: nltk
to remove stopwords and tokenize, gensim
for the LDA and bigram models, and pyLDAvis
for the interactive topic visualization.
Prepare libraries
Prepare data
Steps:
- Remove stopwords, emails, newline characters, apostrophes, etc.
- Tokenize (turn each document into a list of individual words)
- Turn it into phrases through bigram and trigram models
- Lemmatize the model
Build the model
After creating an id-word dictionary to identify keywords from the lemmatized data and developing the corpus, build the LDA model from the gensim library and run.
Visualize the model
You can print the keywords of the model to identify key topics (or themes from the words) or visualize in a topic graph using the pyLDAvis library.