Lesson 4

Document Clustering

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Chapter 1: Document Clustering

A common task in text mining is finding interesting groups of similar documents. That is, we would like to identify documents that are similar to each other.

Distances

Once the data is described with features (i.e. word counts), we can compute similarity between documents. Perhaps you are familiar with the Euclidean distance, which computes distances from the square root of the distance on x axis, squared, plus the distance on y axis, squared.

$d \left(p,q \right) = \sqrt{ \sum _{i=1}^{n} \left(q_{i}-p_{i} \right)^2}$

However, Euclidean distance has two disadvantages:

if not normalized, the span of the feature will affect the result
in higher dimensions the distances no longer work well

The way to overcome this is to use a different distance metric, specifically the cosine distance. Cosine distance is normalized by nature, because it doesn't consider the magnitude of the feature, only the direction in which it points.

Cosine distance considers the angle between document vectors. If two document point in the same (or similar) direction, they would be considered close no matter how many occurrences of words there are. Only the distribution of words matters.

The plot shows vectors of three documents. Vectors correspond to the number of times the words "fox" and "wolf" appear in the text. By Euclidean distances, the two most similar documents would be Doc2 and Doc3, despite Doc2 mentioning wolves more frequently than foxes. If we take the angle between these vectors instead, Doc2 is the most similar (closest) to Doc1, which makes sense since they both talk predominantly about wolves. This is why we use the cosine distance.

Example

Now, let us go back to our Grimm's Tales and construct the following workflow:

You can try the same workflow on a different corpus, say bookexcerpt.tab, which contains excerpts from adult and children's books.

The Hierarchical Clustering widget visualizes the clustering in a dendrogram. Connect Corpus Viewer to Hierarchical Clustering and open both widgets. Now click on a cluster in the dendrogram and observe the documents from the selected cluster in Corpus Viewer.

Explore different clusters. Why are some Tales of Magic mixed with Animal Tales? What do they have in common?

Hierarchical Clustering builds a hierarchy of documents and it is up to us to define what is similar enough to be in one cluster. We can set the appropriate degree of similarity by dragging the vertical line left or right in the visualization.

We went with two groups, since the distance between the clusters increases at that particular cut-off. Compare this to the cut-off we would require for three, four or five clusters. The clusters we get also correspond nicely with the designated Aarne-Thompson type (ATU Topic).

But how close are the animal tales from the third and animal tales from the last cluster? Let us see the documents on a plane, where similar documents would lie close to each other. This visualization is called Multidimensional Scaling or MDS in short.

Tales of Magic form one group and Animal Tales another - just as we expected. Interestingly, Tales of Magic seem to be more similar to each other than Animal Tales are (they are connected). Inspect similar tales by selecting them in the visualization and reading them in Corpus Viewer.

Chapter 2: Comparison of clustering methods

There are several clustering methods one can use. Hierarchical clustering, k-Means, DBSCAN … It is easy to get lost in the abundance of options, so in this chapter, we will look at the specifics of each approach.

Hierarchical clustering is a simple, easily interpretable methods. It is suitable for smaller datasets, that can be observed in the dendrogram. It shows a hierarchy of clusters. It does not require a predefined number of clusters. Instead, one can determine the number of clusters by selecting a suitable cut-off in the dendrogram.

k-Means is a fast and efficient method, suitable for larger datasets. It is biased towards spherical, similar-sized clusters. k-Means requires setting the number of clusters in advance. However, silhouette score can help determine the optimal number of clusters.

DBSCAN is a clustering method that finds clusters based on densely populated areas. Where the data space is sparsely populated, it creates a cut-off. DBSCAN can also decide not to place a data point in a cluster — it considers it an outlier.

Gaussian Mixture Models (GMM) is a soft clustering method, which means each data point belongs to a cluster with certain probability. It is similar to k-Means, however, GMM is not biased towards spherical clusters as it can modify the parameters accordingly. The method fits several Gaussian distributions over data space. Bayesian Information Criterion is used to find a model, that is simple enough, yet fits the data well.

In summary

Chapter 3: Word Enrichment

We can explore clusters in several ways. Typically, we'd like to know what are the characteristic words for a given cluster. This can be achieved with the Word Enrichment widget.

Select a cluster of fairy tales in the Hierarchical Clustering widget. I have selected a large cluster of Animal Tales, which outputs 16 documents. Now, I'll connect these documents to Word Enrichment. The widget requires two inputs — a data subset and the entire data. Hence, I'll connect the entire data from the Bag of Words to Word Enrichment, too.

Note the connection types. Hierarchical clustering must input Selected Data into Word Enrichment, while Bag of Words must input Data.

Word Enrichment compares a subset of documents against the entire corpus and finds statistically significant words for the selected subset.

I must change the settings in the widget to display the results. I'll turn off FDR filtering and turn on the p-value filter. Why? Because FDR will typically fail on small datasets. P-values, on the other hand, will retain only significant words.

The most significant word for the cluster is fox. This doesn't surprise us — foxes are in the title of many tales from the cluster. Moreover, the tales are animal tales, so an animal being the most significant is expected.

Clicking on different clusters reveals their characteristics. For example, the top cluster of Tales of Magic is defined by the word king.

Chapter 4: Document Maps

Document maps show us the overall structure of the corpus. In other words, they provide a global overview in 2D space. They are extremely useful for determining similarities between documents and potentially estimating the number of clusters in the data.

We will have a look at two embedding techniques that result in 2D plots, namely MDS and t-SNE. MDS stands for multidimensional scaling and it is an embedding technique that optimizes inter-sample distance. Imagine turning complex data, like text, into points on a graph. MDS arranges these points so that similar ones are close together and different ones are far apart. The technique produces a so-called document map.

The following example uses grimm-tales-selected data, default preprocessing with lemmagen lemmatizer, and a TF-IDF transform.

Above, we see the documents placed on a map. The separation between animal tales and tales of magic is evident. However, there is some overlap. For example, The Bremen Town Musicians is a tale of four animals, who trick a group of robbers and move into their house. While the protagonists are animals, the story is more similar to tales of magic with the theme of trickery.

While MDS is a great tool for observing a global landscape of the data, we are often more interested in observing only the similarity, not so much the difference. t-SNE, an acronym for t-distributed stochastic neigbor embedding, does exactly that. The embedding places points on a graph so that similar ones are close together, completely ignoring any measure of dissimilarity. This forms more pronounced, local clusters that share similar attributes.

The t-SNE graph shows the same data as the MDS above.

The t-SNE embedding is quite different from MDS. The groups are somewhat better pronounced, in particularly the group of animal tales. Let's have a look at the two closely positioned tales - The Juniper Tree and The Hare and the Hedgehog. It is not evident from their titles why they have been placed so closely. A quick look at the Box Plot shows they share the words cross, song, and best. Sometimes, the similarity is somewhat random. Is there a way of representing the documents in a way that captures more context? You are probably already familiary with embeddings, so we'll let you compare an embedding-based t-SNE to the one above.