Clustering

Extra material about clustering methods

The additional material this time focuses primarily on the background of the algorithms. It is not necessary to read it for conducting the lessons, though it certainly won’t hurt. The material also explains in more detail the workings of the Orange building blocks related to clustering. It is not essential to master this down to the last detail, as the lesson descriptions already include working workflows.

Comments on the lessons are brief this time, as there is not much more to add to the descriptions available on the Pumice.si website.

This material is being developed as part of the DALI4US project.

Chapter 1: Introduction

Predictive models, which we discussed in the previous volume, are used when dealing with data that belongs to different groups (which we called classes, types, or categories—in practice, these were professions of dwarves, types of quadrilaterals, or animals). Our task was to discover rules that allow us to determine, based on known characteristics, which group an unknown object belongs to.

Now we face a different problem. The groups are not known—we only have the characteristics of the objects. Our task is to find groups of similar items. This process is called clustering.

Learning predictive models comes naturally to us—it’s just easy to overlook how simple it is. We don’t just learn to predict the weather; we also learn that hitting a finger means pain—something even babies pick up quickly.

In our introduction to predictive models, we told stories about farmers from Vipava, who, for generations, observed the winds, the cloud caps over Nanos, and the temperatures at St. Pancras to predict the weather. Now, let’s look at things from another angle: our ability to perceive clusters is so innate that we cannot even avoid it. Gestalt psychologists have thoroughly studied when and how we recognize clusters. There are several principles of grouping, but the one most relevant here is proximity: we tend to group things that are close together. However, in this case, “close together” won’t refer to physical distance but rather to similarity—two things will be considered close if they are similar.

In the image on the left, we all see four groups. No computer is needed for this. We naturally group things together. A simple example is found in the lesson Hierarchical Clustering, where it’s easy to spot three or four groups of students based on their math skills and soccer abilities.

Our ability to recognize groups fails when the objects we want to cluster are described by more than two variables—since we can’t visualize them. Additionally, we are limited in how we perceive distance; we only see “straight-line distance,” known in mathematics as Euclidean distance, calculated using the Pythagorean theorem. However, Euclidean distance is often not the best measure of similarity.

In such cases, we must rely on computers.

Like prediction, clustering is one of the fundamental tasks in data analysis, and there are many clustering methods. Here, we will explore two: hierarchical clustering and k-means clustering.

Chapter 2: Hierarchical Clustering

The hierarchical clustering algorithm is straightforward.

Initially, each item is its own group.
Merge the two closest groups.
Repeat Step 2 until there is only a single group left.

That's it. :) You can find a more detailed description in our lesson Hierarchical Clustering.

Dendrogram

At first glance, the result of hierarchical clustering is a single large group. That would clearly be pointless.

As the clustering process progresses, it records which two groups were merged at each step and how far apart they were. It displays that in a dendrogram.

δένδρον (déndron) is a Greek word for tree, and γράμμα (grámma) means writing or letter. So the word dendrogram can be translated as a tree diagram or a graphical representation of a tree.

The tree has been created from right to left. From it, we can see the order in which the groups were clustered together: Benjamin and Hannah were the most similar to each other, followed by Ginny and Cecilia, then Andrew and Emma, and so on. The lengths of the lines represent the distances between the groups that were merged.

Dendrogram helps us decide the number of groups to create. While the process always runs till the end, we can subsequently decide that we want to keep, for example, three groups. We "cut" the dendrogram in that part, and see the members in the groups. In the above dendrogram, we have Henry, Ivy, Cecilia, and Ginny in the first group, Joseph, Daniel, and Fanny in the second group, and Benjamin, Hannah, Andrew, and Emma in the third group.

We could also decide on just two groups; in that case, Joseph, Daniel, Fanny, Benjamin, Hannah, Andrew, and Emma would all be in the same group. Or we could form four groups; in that case the first group would be split into two: Henry would be separated from Ivan, Cecilia and Ginny. Actually, looking at the image above, which shows how the students are distributed in space, we could indeed have three or four groups, but not two.

The numbers — 35, 50, 110 — on their own don't tell us anything. What we are observing in the dendrogram is the point where the jump occurs and the lines suddenly become longer.

However, in practice, we don't really have the luxury of looking at a chart, since data can be multidimensional — perhaps we want to also take into account the students' knowledge of English and history? How do we decide on the number of groups in such cases? Well, with the help of a dendrogram. Typically, groups are similar to each other at the start, but then the distances suddenly increase, and the lines in the dendrogram stretch out. That is where we cut the dendrogram. In the seventh step of the process (Do we know why the seventh? Do we know how to count the steps?) we merge the cluater containing Benjamin and Hannah with one with Andrew and Emma; the distance between those groups was a little over 35 (see the axis above and below the dendrogram). In the eighth step, we connected Henry with Ivy, Cecilia, and Ginny. The distance between those groups was 50. In comparison, the distance between the closest groups out of the remaining three is almost 110, which is considerably more than 50.

Distances

How do we calculate distances?

We measure them on a graph. Or, if we have the coordinates, we simply use Pythagoras' theorem, right?

Let's first remember: there can be many dimensions (variables). We can, for instance, observe students' grades in all school subjects. Or similarities between countries based on socioeconomic data. We can also observe similarities between animals based on their characteristics. When we were playing with building a recommender system we were interested in how similar the students were based on their selection of favourite cartoons. How do we measure distance in such cases?

Obviously, we measure distances between items based on their characteristics. For a start, let's imagine that all data is numerical. The difference between two items is the same as the difference the numbers that describe them. And the total difference between two items is then a sort of sum of those differences, right?

Well, yes, roughly... There are many different definitions of distances. Here, we will look at four that will probably come in handy in the classroom.

Euclidean and Manhattan distance

Euclidean and Manhattan distance are the most natural. With some details.

The scales of variables must be comparable!

The first detail is different scales. Take two students, for example. The first one is 1.85 m tall and weighs 63 kilos, the second one is 1.60 m and weighs 65 kilos. How much do they differ? Their height difference is 0.25, and their difference in weight is 2. That means they are almost the same height, but very different in weight? Their weight difference is 8 times bigger than their height difference? Clearly, that is not the case. To be able to somehow "add up" differences that reflect completely different characteristics - weight, height, hair length, number of siblings, and grade in biology, all the variables need to be put on the same scale. If we subtract the weight of the lightest student from the weights of the students, and then divide that number by the difference between the lightest and the heaviest student, the lightest student will weigh 0, and the heaviest 1. But that is just one type of normalisation; a longer list can be found on Wikipedia.

We should normalise values when they are measured on different scales, such as in the example above. If the scales are the same, normalisation isn't necessary or not even desired. If we, for example, compared high school students based on their grades in different subjects and we believe that the difference between 3 and 5, i.e. 2, is the same regardless if we are talking about PE or Maths (where grades are typically lower), then normalisation isn't necessary.

Total difference is not necessarily simply a sum of individual differences.

The second detail we should consider is what we are adding. We can add up differences. More precisely: the absolute value of differences — in the above example we should add -2 to 0.25 because one student is taller and the other one heavier. The difference between the two students above (ignoring normalisation) is 2.25.

Alternatively, we can add the squares of the differences and then take the square root of the sum. So, 0.25² + 2² = 4.0625, and the square root of that is 2.016.

We call the first variant Manhattan distance and the second one Euclidean distance. The first one was named after Manhattan. If we stand at the intersection of 10th Street and 11th Avenue, and we want to get to 7th Street and 15th Avenue, we will need to move 3 streets and 4 avenues. Since the grid is rectangular (and, let's say, even square-shaped), the distance we will need to walk is 3+4=7 blocks, in whichever order.

The second type of distance was named after Euclid, but it could have also been called "Pythagoras distance". If we have a helicopter available for our trip from (10th, 11th) to (7th, 15th), Pythagoras is the one who can tell us the distance we need to fly: it's the square root of $3^2 + 4^2$ , which, what a coincidence, is exactly 5.

Both definitions make sense, each has its advantages and downsides. And both can be calculated with any number of dimensions (ahem, variables). If there are two, like in our case, we add two (squared) differences and take the square root of the sum. If there are a hundred, we simply calculate the sum of a hundred (squared) differences and take the root of the sum (with the square root, like we did earlier).

But what if our variables aren't numbers? In that case, we add 1 to the distance when the values are different, and 0 if they are the same. If, in the case of our two students above, we also consider eye colour, we will add 0 if both students are blue-eyed, brown-eyed, or green-eyed, and 1 if, say, one of them has green eyes and the other brown.

Cosine distance

Let's say we are comparing towns according to their livestock profile.

	Cows	Pigs	Sheep
Littlewick	40	20	5
Sprockleton	400	200	50
Upper Nibsworth	150	200	100

Which two towns are most similar to each other? That depends on how you look at it, but we would dare to claim that it's Littlewick and Sprockleton. Littlewick, as the name suggests, is a rather small towns; in fact, we know it is ten times smaller than Sprockleton. And, who would have thought: they also have exactly ten times fewer livestock. But the livestock profiles of those towns are exactly the same: there are two times fewer pigs than cows and four times fewer sheep than pigs. In Upper Nibsworth, on the other hand, they mostly raise pigs.

Geometrically speaking, Littlewick and Sprockleton lie "in the same direction"; except that Sprockleton is ten times farther away. The angle between Littlewick and Sprockleton is 0.

The cosine distance imagines data as two vectors in space, and the "distance" between them is the cosine of the angle between them. Take a look at the image on the left: the cosine distance corresponds to the angle, that is, the length of the arc. But since we are calculating the cosine of the angle, it corresponds to the length of the red line below the arc.

We use cosine distance when we are not interested in numbers themselves but in the ratio between them.

Jaccard distance

Jaccard distance is the distance between sets. We use it, for instance, in our lesson about recommendation systems, where we measure similarity in preferences based on students' choices of favourite cartoons.

Distances between groups

Earlier, we kept quiet about another important detail. We talked about the distance between Henry and Ivy, Cecilia and Ginny. What do we actually mean by that? The distance between Henry and Ivy, between Henry and Cecilia, or between Henry and Ginny?! What about the distance between Benjamin and Hannah and Andrew and Emma? Between which pair specifically?

What is the distance between the blue and red crosses in the image below? How could we "measure" it?

If the image was only slightly different (the groups would just need to be circled!), the answer to that question would be practically unanimous at first glance: we would "see" the distance between the groups as the distance between their two closest elements.
Taking the image as it is now, though, some would perhaps find it more appropriate to calculate some sort of "average distance" or the distance between the centres of the groups. That sounds reasonable.
Others - almost certainly a minority - might argue that we should measure the distance between the two most distant elements of the groups.

If someone suggested that for each group, we calculate "the middle" (or "the center"), some kind of centroid, we should remind them that the clustering method is based on distances that aren't necessarily geometrical. How can we calculate the "centroid" based on the Jaccard distance?

According to the first recipe, we should look for the pair with the shortest distance between them; that will then be our distance between the groups. The third approach is similar, except it looks for the pair that is the farthest apart. According to the second method, we should simply calculate the average distance between all the pairs.

So, then, which method is "the right" one? Which one is actually used in practice? Whichever we decide on, but in almost all cases, the best choice is the fourth one: Ward's distance. It is mathematically more complex and it minimises the variance between clusters. For most readers, that explanation will do; the rest can dig further, for instance in the Wikipedia article on Ward's method.

Why Ward's distance? Well, with that method, we typically get a dendrogram with nicely separated clusters. We said earlier that distances are usually small at first, and then suddenly increase dramatically. That is true for Ward's distance, but for the others … not so much.

Chapter 3: Widgets for Hierarchical Clustering in Orange

Hierarchical clustering will require two widgets: one that calculates distances and another that calculates a dendrogram.

Widget Distances

In the Distances widget, we select how we want to measure distances. It offers all the methods discussed in the previous chapter, along with many others. Additionally, we decide whether to compare rows (items) or columns (features). If we choose the latter, the next step will produce, for example, hierarchical clustering of subjects instead of students. This way, we might discover that math and physics are similar because students who excel in math tend to do well in physics too.

For this widget, the settings are not as interesting as the inputs and outputs. Like most widgets, it takes a data table as input. However, its output is not a data table but a distance matrix.

The Distance widget cannot be connected to Table, Tree, or most other widgets, as they require a data table and have no use for a distance matrix.

Widget Hierarchical Clustering

The Hierarchical Clustering widget requires a distance matrix and calculates the hierarchy of clusters, displaying a dendrogram. It will also accept a table of instances, but it uses it for other purposes, which we won't delve into. So, we will always connect hierarchical clustering to the Distance widget. If we link it directly to File, Data Collections, or another widget that returns a data table, clustering won't work.

In the widget, we are mainly interested in two settings.

We can specify the label that appears as the "name" of the item. The widget is relatively smart and will automatically find the appropriate variable. If it makes a mistake — or if we want to observe something specific — we can change it.

Setting a label in Hierarchical Clustering

Replay

Additionally, we can place a checkbox next to the labels, where the color of the checkbox reflects the value of a specific variable.

Setting a color in Hierarchical Clustering

Replay

By clicking on the groups, we can select the items within them. We can also select multiple groups by clicking on the first one, then holding Ctrl (or Cmd) while clicking on the others. The Hierarchical Clustering widget will pass the data to the widgets connected to it, such as the Table widget. A new column will appear in the data with the cluster labels.

Replay

In the Dendrogram, we can determine where to cut the tree by clicking on the axis. If we click at a specific height, we will get clusters that are similar up to that height. The threshold can also be adjusted by clicking and dragging. In this case, the widget will pass the data to the connected widgets again — this time with an additional column indicating the cluster each item belongs to.

Setting a threshold in Hierarchical Clustering

Replay

The number of clusters can also be set explicitly.

Nastavitev števila skupin v gradniku Hierarhično gručenje

Replay

Chapter 4: K-means Clustering

K-means clustering is entirely different. Here’s how it works:

Randomly(?) select k coordinates. These will be the cluster centers.
Assign each item to the cluster whose center is closest.
For each cluster, determine where its actual center is based on the items within it.

Since the cluster centers change in step three, we must repeat step two and reassign the items. But after repeating step two, the clusters change again, meaning we need to recalculate the centers. Once the centers are updated, the items must be reassigned… which, in turn, requires another recalculation of the centers.

In short, steps two and three are repeated until the centers stop moving and the clusters remain stable. This always happens (because both steps reduce the total distance between items and their centers) and usually quite quickly.

The number k is predetermined. For example, it could be 3. We can see the process in action on a set of randomly scattered points.

Replay

It could also be 4.

Replay

The success of the process depends on the initial placement of the centers. If we're unlucky, we might end up with a situation like the one on the left, where the orange and red centers split the upper-left cluster, causing the green and blue centers to take over the lower-right points. Algorithms address this issue by not selecting the initial coordinates entirely at random and, most importantly, by repeating the entire process multiple times, ultimately choosing the solution where the clusters are the most compact.

When to Use Hierarchical Clustering and When to Use the Center-Based Method?

The animations above might give the impression that the center-based method is more visual and appealing. It’s not. Quite the opposite. The result of clustering is simply a list of cluster members — no scatter plots with points (since we don’t have just two variables) and no dendrograms.

If we used a different distance metric instead of Euclidean distance, the process could fail to converge and get stuck in an endless loop — unless we also changed how the centers are determined, which leads to further complications.

K-means clustering requires a table of data instances, not a distance matrix. As for distance, there’s no choice: when determining which center an item belongs to, the algorithm uses Euclidean distance; otherwise, the process wouldn’t work correctly.

The advantage of the center-based method is its lower memory usage when dealing with large datasets. This doesn’t concern us in the classroom. However, you might notice that we used k-means clustering in the lesson Climate Zones of Europe. The simple reason? Hierarchical clustering didn’t give us good results. Why not? Who knows. So, we’re having the center-based method as a backup option.

Chapter 5: K-means Clustering in Orange

Since clustering with k means is not as visual as hierarchical clustering, the widget is also simpler.

The input for the K-Means widget is not a distance matrix but a table of data. The output is the same table with an additional column indicating which cluster each row belongs to.

In the widget, we can set a predefined number of clusters or let it test different numbers within a given range. In this case, it will display a table on the right with clustering quality scores for each number of clusters. The widget will automatically choose the best number, but we can also manually select the number of clusters by clicking on the correponding row in the table.

Chapter 6: Comments Regarding Lesson Plans

Clusters in the Classroom

Clusters in the Classroom can be done without a computer, but it makes sense to demonstrate that the exact same process can also be performed by a computer.

The workflow is ready and easy to use. More details are described in the previous chapter on widgets for hierarchical clustering in Orange.

It’s worth mentioning that Distance widget is set to compute non-normalized Euclidean distances because we want the distances, as seen by the clustering, to match the distances in the graph. However, we need to ensure that both axes in the Scatter Plot widget have approximately the same scale. This can be adjusted by making the widget wider or narrower.

Discovering Animal Groups

The use of the computer in Discovering Animal Groups is minimal and is thoroughly explained in the lesson description.

Climate Zones of Europe

Unlike other lessons, in Climate Zones of Europe we use k-means clustering, as hierarchical clustering does not give good results: in the best case, it recognizes the Mediterranean climate, but it groups Paris and London with Warsaw and Moscow instead of Berlin and Amsterdam. Why? Who knows. Maybe the problem lies in the insufficient amount of data.

The workflow is large, but only because it contains four line-chart widgets that display different things.

To make the workflow clearer, we have renamed some of the widgets so that they have names that indicate what they do in this specific workflow, instead of using their default names that indicate their general function.

The workflow includes four "Select Columns" widgets, which we use to select variables related to temperature or precipitation. The first two receive data from the table so that we can observe temperatures and precipitation in individual cities. The other two receive data from the clustering process, allowing us to observe temperatures and precipitation in individual groups.

In the k-cluster, we have set the number of clusters to 4. As explained in the lesson description, we would actually like to have three clusters, but we set the widget to 4 because Podgorica is so different from the other clusters that the algorithm assigns it its own group.

Socio-Economic Characteristics of Countries

The lesson in which we explore the Socio-Economic Characteristics of Countries was tested in high school. In this lesson, students also create the workflow themselves, with the help of the teacher. We will also simplify the description of this lesson a bit more.

In the lesson, with various visualizations, we explore the relationships between different characteristics of countries, such as, for example, the relationship between average years of education and life expectancy. We then create clusters and observe which countries are similar to each other and how they differ from countries in other clusters.