Text Mining
Lesson 5: Topic Modeling
These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.
The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.
The material is offered under Create Commons CC BY-NC-ND licence.
Chapter 1: Latent Dirichlet Allocation
Another way to organize unlabelled documents is with topic modelling. This is a technique that aims to discover latent topics in texts. Unlike clustering, topic modelling looks at word distributions and infers topics from them. The more frequently the words appear together, the more likely they form a topic.
One of the most popular topic modelling techniques is LDA (short for Latent Dirichlet Allocation, not to be confused with Linear Discriminant Analysis). LDA is a generative models that starts with randomly assigned topics, which are iteratively updated based on probabilities of words in a topic.
The procedure is as follows:
-
Documents are randomly assigned to topics.
-
Compute word-topic matrix.
-
Compute document-topic matrix.
-
Repeat until algorithm converges:
- Select a word in a document and set its probability to unknown.
- Update document-topic matrix and word-topic matrix.
- Use Dirichlet priors to compute the new probability of the word in a document (probability is a product of document's topic probability and word's topic probability).
- Update document-topic matrix and word-topic matrix.
We can observe an example from the PMC-Patients-children data set. After preprocessing and the preparation of a TF-IDF weighted bag-of-words matrix, we can run LDA topic modelling. Say we go with 10 topics.
LDA returns a few of the most probable words for each topic. The next step is to name the topics. We can do that by reading the most probable words, or we can further explore each topic with LDAvis visualization. This visualization helps to determine characteristic words for each topic, meaning it shows the relationship between the word's topic probability and its document probability.
In this way, we can name the uncovered topics, for example:
- Topic 1: neural issues
- Topic 2: dermatological issues
- Topic 3: developmental issues
- Topic 4: genetic mutations
- Topic 5: cancerous symptoms
- Topic 6: urine data
- Topic 7: respiratory issues
- Topic 8: issues with circulatory system
- Topic 9: intravenous treatment
- Topic 10: blood data
We can observe marginal topic distributions in a 2D projection called MDS. This will show us the relationship between topics (how similar their word probabilities are) and how frequent they are in a corpus.
Estimating the number of topics
A frequent question concerning the LDA method is how can I know the right number of topics? Surely, there must be a method that aids in deciding the right number of topics?
There are (at least) two, but they have their own set of shortcomings. One is called log perplexity. It is a measure of how well the model estimated the probability distribution over words compared to the actual probability distribution. In other words, log perplexity measures how well the model predicts the test set. If the model assigns high probability to words that actually occur frequently in the test set, the log perplexity will be lower.
The other method is topic coherence. It aims to provide a quantitative estimate of how interpretable the topics are. The method is based on pointwise mutual information score. In other words, topic coherence measures how related the words are within the topic. It calculates the similarity between pairs of words and then aggregates these similarities to get an overall coherence score for the topic. Higher coherence values indicate a better model and it would mean the words in the topic are more semantically related.
For both methods, one would typically train several topic models (with topic numbers from, say, 5 to 15) and select the best performing number of topics. However, log perplexity is sensitive to corpus size, meaning larger corpora will have lower perplexity that small ones. Topic coherence is subjective and domain-dependent. Both are sensitive to noisy data (i.e. tweets) and are difficult to interpret. With this in mind, the two scores can guide the decision on topic numbers but interpretability of topics remains paramount.
Chapter 2: BERTopic
BERTopic is a modern transformer-based topic modelling technique that leverages word embeddings to determine topics. How it works:
- It creates word embeddings.
- Uses UMAP dimensionality reduction on embeddings.
- Clusters reduced embeddings with HBDSCAN to form clusters of topics.
- Uses a mixture of silhouette score and topic coherence to determine the number of clusters.
- Assign topics to documents and keywords to topics.
- Runs a refinement procedure to prevent topic overlap and increase topic diversity.
BERTopic considers context and is thus more suitable to context-dependent tasks. It can also handle short texts, unlike LDA. The user doesn't have to specify the number of topics in advance, which is another advantage compared to LDA. However, the model is fairly slow to compute. BERTopic achieves a marginally better quantitative performance, while LDA aligns much better with human interpretation.
Chapter 3: Topic Modeling Comparison
Approaches | LDA | BERTopic |
---|---|---|
input | bag-of-words | word embeddings |
preprocessing | important | not necessary |
procedure | interative topic assignment | clustering-based |
number of topics | required | not required |
topics formed on | word co-occurrence | context + similarity |
text type | works well on longer texts | can handle short texts, too |
scalability | good | poor |
In summary, LDA is a traditional statistical method for topic modeling, best suited for simpler, less context-dependent tasks, and is relatively straightforward to implement. BERTopic is a modern approach that leverages advanced transformer models to capture deep semantic meanings and is more powerful for complex and dynamic text data but at the cost of higher computational requirements.
Chapter 4: Other Approaches
Besides LDA and BERTopic, there are a few other topic modeling techniques.
Non-Negative Matrix Factorization (NMF)
NMF is a matrix factorization technique that approximates the document-term matrix by two lower-dimensional non-negative matrices. This factorization leads to an easy interpretation where one matrix represents the topics and the other represents the topic distributions within documents.
Latent Semantic Analysis (LSA)
LSA, also known as Latent Semantic Indexing (LSI), is a technique based on singular value decomposition (SVD) of the term-document matrix. It reduces the dimensionality of the matrix and captures the underlying structure in the data, which can be interpreted as topics.
Hierarchical Dirichlet Process (HDP)
HDP is a non-parametric Bayesian approach to topic modeling that allows for an unknown number of topics. It is an extension of the Dirichlet Process and can infer the number of topics from the data.
Top2Vec
Top2Vec combines topic modeling and word embeddings. It uses semantic similarity to group documents into topics and finds topics in a vector space, often yielding more coherent topics compared to traditional methods.
Special techniques
Dynamic Topic Models (DTM)
DTM is useful for analyzing the evolution of topics over time. It extends LDA to model how topics change in a time series of documents.
Structural Topic Model (STM)
STM allows incorporating document-level metadata (e.g., author, date) into the topic modeling process. This helps in understanding how external variables influence topic distributions.