Hands-on Training about Data Clustering

Lecture Notes and Material for Trainers

These lecture notes describe a set of pedagogical units for hands-on training in data clustering. The training includes:

motivation for hierachical clustering and presentation of the agglomerative clustering algorithm,
exploration of clustering results,
analysis of inliers and outliers,
k-means clustering, as one of alternative clustering algorithms,
few extra topics on common gotchas in cluster analysis.

Suggested pedagogical units focus on practical problem solving and mini-challenges and mathematics, statistics, computer science, or data science background, or data science background. The training and data profiling of the object of interest, measuring similarities and of interest, measuring similarities and distances, hierarchical clustering, dendrogram hierarchical clustering, dendrograms, cluster explanation, outlier detection using silhouette outlier detection, k-means clustering, quantitative evaluation of clustering, and Inferring the appropriate number of clusters.

Pedagogical Approach

The training can be implemented through a frontal lecture. A better alternative is to carry out the training in a computer classroom, with senior instructor leading the course and students independently perform all data analysis on computers. Such a setting may require additional teaching assistants to mingle in the classroom, help students when they get stuck, and interact with the instructor when noticing, say, an exciting challenge in one of the students' work.

To facilitate a data science training, where users can select and interact with any part of a visual representation of the data or results, lessons are based on Orange Data Mining, a data science toolbox. Orange uses workflows to design visual analysis procedures where changes in one component, such as a selection of a part of the dendrogram, are propagated to downstream components of the workflow to explain the selected cluster, for example.

Software platform

Data science training, where users can select and interact with any part of a visual representation of the data or results, including bar charts, scatter plots, and dendrograms, for exploratory, analytical, or simply curiosity-driven purposes, requires an appropriate tool. Orange Data Mining is a free, open-source data science toolbox that uses workflows to design visual analysis procedures where changes in one component, such as a selection of a part of the dendrogram, are propagated to downstream components of the workflow to explain the selected cluster, for example. Orange uses interpreted visual programming so that any user interaction immediately affects related downstream components and visual representations.

While all examples in these lecture notes use Orange, the same approach to conceptual teaching of clustering could — in principle and with necessary adaptations — be implemented using other excellent workflow-based data mining systems, such as KNIME and RapidMinder. Differently to Orange, the workflows of these tools are either compiled and executed on-demand or lack visual interactions with data or model visualizations, so their support for visual analysis is limited.

Orange Data Mining software available freely and released as open source. The software comes with documentation, workflow examples, and an extensive collection of instructional video material. The latter could provide for a complementary material to pedagogical units from our manuscript. For example, here are a few short videos to get familiar with Orange,

and a few short videos to familiarize with some basic exploratory data analysis in Orange:

Acknowledgments and Licence

These notes are designed as accompanying material of the manuscript "Hands-on Training about Data Clustering with Orange Data Mining Toolbox" submitted to an Educational Section of PLOS Computational Biology and prepared by Blaž Zupan and Janez Demšar. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC-BY 4.0 licence.

Chapters

Chapter 1: Hierarchical Clustering

Data

Humans can easily observe a two-dimensional map and consider distances between points, densities, and potential groupings. It is, therefore, convenient to start explaining data clustering on data sets comprising only two features and then show that the same concepts apply to data sets of arbitrary dimensions. Through the Datasets widget, Orange can access readily available data sets on its data server. Here, we will consider one of them, the Course Grades data set, comprising sixteen (imaginary) students graded in seven school subjects. The workflow loads the data, selects two features, English and Algebra, and displays them in the scatter plot. Considering these two features alone, how many clusters are there?

We propose that the training on clustering starts with this workflow. It loads a small data set on student grades data (the Datasets widget) and passes it to the Data Table for review and to Select Columns to construct a new, reduced data set. In the Select Columns widget, we retain only the features on English and Algebra grades and keep student names as meta-features. The visualization of reduced data set in the Scatter Plot is the starting point for the discussion on clusters and clustering procedures.

Distances between data items

Check out the accompanying video on Euclidean Distances in 2D and corresponding Orange Workflows.

Answering the question on the number of clusters in the Scatter Plot is not trivial, and trainees typically propose anywhere from two to eight clusters. Despite a simple data set, we require a principled, algorithmic approach to tackle this problem. The class would agree that Demi and Nash should be in the same cluster, and so should Phil, Lea, and George. Why? Because they are close to each other in the scatter plot. In the classroom, we would use the ruler to assess the "closeness" through the distances between the data points in the scatter plot. The computer, though, has only the data coordinates. Given a pair of students, say students $A$ and $B$ , and their coordinates in the scatter plot ( $e$ for English and $a$ for Algebra), the ruler-estimated distance can be computed with Euclidean distance,

$d(a,b) = \sqrt{(A_e-B_e)^2 + (A_a-B_a)^2}$

For Demi and Nash, for example, we have ${\rm Demi}_e=9$ , ${\rm Nash}_e=14$ , ${\rm Demi}_a=61$ , and ${\rm Nash}_a=61$ , hence their distance $d({\rm Demi}, {\rm Nash})$ equals $\sqrt{5^2+0^2}=5$ . Take Cynthia and Ian, with ${\rm Cynthia}_e=51$ , ${\rm Ian}_e=91$ , ${\rm Cynthia}_a=100$ , and ${\rm Ian}_a=82$ which are obviously further apart, with their distance of $d({\rm Cynthia}, {\rm Ian})=\sqrt{40^2+18^2}=43.86$ . A sensible clustering would consider placing Demi and Nash in the same cluster, but probably leave Cynthia and Ian separated.

Linkage

See the video on Cluster Distances that explains the concept of linkage and introduces it through a scatter plot visualization in Orange.

Starting from distances between data items, we can devise a procedure that starts with individual data points, considering each as a separate cluster, and then iteratively merges the closest clusters. Let us cluster Demi and Nash, and George, Lea, and Phil. But what do we do with Maya? Should we form a cluster with Eve or merge it with Lea, Phil, and Gorge? How do we measure the distances between clusters that comprise more than one data point? Asking trainees they most often propose to measure the distance between the centers of clusters. We seize the opportunity to remind them that the algorithm works with distances, not coordinates.

The coordinates of the cluster's center thus cannot be computed; indeed, in general, there is no such thing as 'the center of a cluster'. We, however, follow their lead and at this stage propose computing the average pairwise distance between elements of the two clusters, also referred to as average linkage. Alternatives include representing cluster distance with the distance between two of their closest elements (single linkage), farthest elements (complete linkage), or the sum of squared deviations from the mean within each cluster (Ward linkage). The use of different linkages may have a significant impact on clustering results, but we propose to explore these differences a bit later, once the students already understand the mechanics of the clustering and the ways to interpret the results.

Dendrogram

Check out the video on Dendrograms, their construction and visualization in Orange.

At this stage, we are equipped to perform hierarchical clustering by hand, say, on the blackboard. We can circle the data points we join in a cluster and record the distances between merged clusters in the dendrogram. We would start with various branches in the dendrogram, say, with Nash and Demi, Lea and George, then add Phil to Lea-George and Katherine to Demi-Nash cluster, and so on. There is no need to be exact: we can estimate the distances visually and then compare the result with the one obtained with Orange.

This workflow retains the Datasets and Select Column widget, but feeds the data into the Distance widget that uses Euclidean distance to construct the distance matrix. It is instructive to look at the distance matrix (not shown here) and compare the distance computed manually with those from Orange. For this comparison, the feature normalization in the Distance widget was turned off. The distances are passed to the Hierarchical Clustering that visualizes the dendrogram and supports the interactive choice of the clustering cut-off point. In this way, the Hierarchical Clustering would output the whole data set with the added feature that indicates cluster membership, allowing us to view and discuss the clustering results in the Scatter Plot.

The dendrogram provides a neat visualization of the clustering structure but does not answer the question of the number of clusters. Choosing the number of clusters is often challenging, and we can warn the trainees that we usually find it by combining quantitative results with domain knowledge. Here, the dendrogram suggests the data may have three clusters because the distances between them, measured by the chosen linkage function, are greater than the distances between their constituent parts.

Clustering in multi-dimensional space

So far, we worked with the student data that included grades from seven study subjects. To generalize the algorithm to multivariate data, we return to the hierarchical clustering algorithm, and summarize it to stress that it depends on three key ingredients: a procedure to measure distances between data items, a procedure to measure distances between clusters, and an agglomerative procedure to construct a clustering model.

Measuring distances between data instances is the only element of hierarchical clustering we need to discuss when moving to multi-dimensional data. It is as easy to extend Euclidean distance to deal with multivariate data. For example, if we add two more subjects, say, History and Biology, the formula for computing the distances becomes:

$d(a,b) = \sqrt{(A_e - B_e)^2 + (A_a-B_a)^2 + (A_h-B_h)^2 + (A_b-B_b)^2}$

We could apply Euclidean distance to the feature space of arbitrary dimensions. In practice, Euclidean distance becomes useless with many data features due to the curse of dimensionality, but it is best to postpone this topic at this stage.

The linkage and clustering procedures are not affected by the data dimensionality and could stay the same. At this point, however, it is good to explain the Ward linkage, which estimates the inter-cluster distance as the variance of the data points in the merged cluster. With this, we are ready to run the clustering in Orange and inspect the results.

In this workflow, we use the Data Table to inspect the data and stress that it now consists of seven variables. The computed distances between students are passed to Hierarchical Clustering, where the dendrogram indicates that we could structure the data within three clusters. In the Distances, we normalized the variables and used Ward linkage in Hierarchical Clustering.

Our data likely contains three distinct clusters. The question is, why and what characterizes them?

Chapter 2: Cluster Explanation

Characterizing the clusters by their defining characteristics is an important step in understanding the data. Considering that this may be the first time our trainees encounter the problem of cluster explanation, we are looking for the simplest means of cluster explanation that is still sufficient to provide interpretable results. Below, we propose the steps to introduce an explanation of hierarchical clustering while noting that we can use the described procedure with any type of clustering or to characterize any data subgroup.

The workflow loads the data (Datasets widget), computes pairwise distances between data instances (Distances), and constructs hierarchical clustering from a distance matrix. In the Hierarchical Clustering widget, the user can select the cluster of choice by clicking the branch of the dendrogram (here marked with blue). The widget outputs the whole data set augmented with an indicator variable that marks instances from this cluster. We use the Data Table to inspect the resulting data and Box Plot to find features that characterize the selected cluster.

To find features that characterize that cluster, we need to augment the data with a column that indicates whether the data instance is a member of the selected cluster. In Orange, Hierarchical Clustering, by default, outputs only the data instances in the selected cluster, but the widget has another output channel with the whole data set and indicator feature. To use this output, we need to rewire the connection between Hierarchical Clustering and two downstream widgets, Data Table and Box Plot. We use the Data Table to show the students that the indicator variable is included as an additional column in the data.

Orange's computational units can have multiple input and output channels. The Hierarchical Clustering outputs data from the selected cluster and, separately, the entire data set augmented with a variable that indicates if the data instance is a cluster member. The default output channel for Hierarchical Clustering is the former, and to change it to the latter, we need to double-click the edge connecting the two widgets in the workflow and change the wiring in the resulting dialog by dragging a line from Data output in Hierarchical Clustering to Data input in Data Table.

Replay

Characteristic features

By this point, students understand how the dendrogram shows the overall structure of the data, and can be used to determine the appropriate number of groups and group membership. A natural question that should arise from discussing the dendrogram is: how do we describe a group? In many cases, experts can recognize the objects belonging to a group and use their prior knowledge to provide its description. Can we do it objectively?

In other cases - like the one we have presented so far - we have no prior knowledge about the objects being classified. How do we describe a cluster in such cases? With sufficient prompting, students should realize that clusters should be described by the features that distinguish them from the rest of the data.

A simple way to do so is to use a box plot to find variables with different distributions of values within and without the target cluster. Box Plot can display distributions of values for chosen variables, and we can set the widget to compare the distributions in the data items within a chosen cluster and outside it.

For our selection of clustering Algebra much better characterizes the cluster than Physical, as the score means for in-cluster and out-of-cluster data instances are further apart and there is less overlap in the distribution of grades.

Framework for visual analytics of hierarchical clustering

The next question that should arise in the classroom is: how do we do this in high-dimensional data? Can we formally define and compute the difference between in-cluster and out-of-cluster distributions?

Note that we use $t$ -statistics only as an indicator of difference, not to perform the eponymous test. If necessary, we explain to students why computing $p$ -values and using them to argue the statistical significance of differences would be incorrect.

Here we make a short excursion into statistics, and describe the difference by the difference of means divided by the variance. Depending on the background of the trainees, we can do this with more or less rigour, either explaining that we are computing the Student's $t$ -statistics, or only intuitively explain that difference in means must be considered relative to the overall spread of the data.

The $t$ -statistics is displayed in the Box Plot. The $t$ statistics for the variable Physical is $2.7$ , and for Algebra $4.5$ . The higher the value of $t$ , the bigger the difference in distributions, and the better the variable characterizes our target cluster.

We can sort the variables in the Box Plot by their $t$ scores by checking out "Order by relevance to subgroups". For our choice of the target cluster, the highest ranking feature is History, where the students in the cluster perform poorly. However, they perform very well in the next-ranked feature, Algebra, and in the following one, Physics. Described feature ordering and inspection of the box plot visualization supports cluster interpretation: the students in the chosen cluster are talented in subjects from natural sciences.

Workflows like the one above support interactive visual analytics for clusters revealed in the dendrogram. With both the Hierarchical Clustering and Box Plot widgets open, we select one cluster at a time and check the variables visualized in the Box plot as ordered by the $t$ statistic.

We already know that the selected cluster of students has an interest in natural science. It turns out that the cluster that includes Ana, Phil, and Bill grades high in subjects related to language and social science. Ana and Henry, students in our third cluster, are good at sports. At this stage, we show the students how a combination of interactive cluster selection, box plot visualization, and scoring by $t$ statistics can yield a powerful tool for cluster exploration and understanding.

Change to a more complex data set

It is instructive to apply knowledge gained so far to a more complex example. Here, instructors should find a data set that fits the audience's interest. A data set worthy of consideration is a Zoo, a data set from the early years of machine learning that contains 101 animals (including some fictional creatures) described by 16 categorical, mostly binary features such as being airborne, giving milk, and having fins. Zoo data set classifies animals into groups like mammals, insects, and vertebrates. Note that the class feature is not used when computing the distances and can also be removed using the Select Columns widget. Interested audiences would also be curious about using categorical features in computing Euclidean distance, and the instructor could explain the one-hot encoding that Orange automatically uses for such features.

We use the same workflow as before to explore clusters in the Zoo data set and just change the data in the Datasets widget. This data set has at least four distinct clusters exposed in the dendrogram an interesting set of features to characterize them.

The figure displays a section of the resulting dendrogram, where we have selected a cluster with birds. The most characteristic feature is, to no surprise, having feathers. Box Plot also shows that there is one animal in the cluster that does not have feathers, and further inspection by selecting this animal (click on the blue box in the "Yes" bar) shows that this is a tortoise, a reptile. Detecting such an outlier may uncover potential erroneous clustering results worth investigating, possibly with Orange widgets that have been introduced so far (hint: tortoise lays eggs).

The data also includes some cluster outliers that we can spot through a combination of cluster selection and Box Plot display and inspect them in the Data Table. This can serve as a cliffhanger for the next pedagogical unit.

Chapter 3: Inliers and Outliers

Some data instances fit into their corresponding cluster better than others. We can easily spot the inliers and outliers in two-dimensional data by observing them in a scatter plot. In higher dimensions, measuring how typical a data instance is for its own cluster requires a formal treatment. In this unit, we introduce one such measure, a silhouette score, which will later give rise to cluster quality scoring in the pedagogical unit on $k$ -means clustering.

A silhouette score

We start with a hand-drawn data set and discuss what makes a data item typical or atypical for a given cluster. In discussion with trainees, we reinvent a data item's silhouette score.

Trainees could, and perhaps should, reinvent this score on their own, with some hints and help from the instructor. A question to explore is what makes a data instance typical for its cluster. To help answering it, we consider a two-dimensional data set visualized in a scatter plot and mark, say, three data points, one in the center of a cluster, one a bit offset from the center and one on the border, close to some other cluster.

Trainees will propose that a data instance typical to the cluster should lie close to the center and, with some help, also recognize that it should be well removed from any foreign cluster. Given a data point, we can quantify both terms by computing the average distance from points in its fellow cluster and the average distance from the closest other cluster, where the closest cluster is the one for which this average is the smallest.

For a given data instance, silhouette score is computed from an average distance to all other data instances in the same cluster (panel a, intracluster distance) and from an average distance to the data instances in the closest other cluster (panel b, intercluster distance).

Let us denote these two averages with $a$ and $b$ . Data instances that fit well into the cluster have small $a$ 's and large $b$ 's. Hence, $b-a$ should be large. The silhouette score is thus defined as $s=(b-a)/max(a,b)$ . Data instances with scores closer to 1 are inliers, and those closer to 0 are the outliers on the cluster's outskirts. Some data instances may have a negative silhouette score; judging by the silhouette score alone, such instances could be moved to another cluster.

Painted data and silhouette plot

We can use Orange's Paint Data widget to paint clusters of points. We aim to gain further intuition behind the silhouette and see if the score identifies inliers and outliers. Here is the workflow we can use for this task. It features the Silhouette Plot widget that computes silhouette scores of all data items and presents them in a bar plot. Trainees can brush through the plot to select data instances with various silhouette scores and observe them in the scatter plot. The red cluster, for instance, has one outlier and several data instances that border the blue cluster.

An exercise that helps in understanding the silhouette score involves painting the data in the Paint Data widget, computation of the silhouette scores in the Silhouette Plot, and selection of inliers or outliers identified in this plot to be exposed as a subset in the Scatter Plot widget.

Outliers in the Zoo data set

Painting data to create a two-dimensional data set served only to gain an understanding of the silhouette. Students need to understand that the silhouette score relies only on the computation of distances, which naturally scales to higher dimensions. To demonstrate this, we show a workflow where we applied hierarchical clustering on the larger data set. In our example, the Zoo data, we cut the dendrogram to expose four clusters.

The workflow loads the Zoo data set, performs the clustering, and then uses silhouette scoring to find the outliers. Data instances with low silhouette scores selected in the Silhouette Plot are reported in the Data Table.

In the largest, blue cluster, which contains all mammals, we can spot several animals with low silhouette scores. We found that dolphins and porpoises have negative silhouette scores, and seals and platypus have scores just above zero. These are aquatic animals and are probably quite similar to animals in the cluster with fish.

Chapter 4: K-Means Clustering

In the last pedagogical unit, we discuss some weaknesses of hierarchical clustering and present one of alternative methods, the k-means clustering.

Hierarchical clustering shines in its visual depiction of clustering results: dendrogram allows for data brushing, that is, the selection of a cluster and further exploration and characterization of selected data. Its weakness is computational complexity when applied to larger data sets. The distance matrix alone, an input to the hierarchical clustering, can be of prohibitive size and expensive to compute. For data comprised of $N=10,000$ instances, the distance matrix would require close to 1 GB of memory. Hierarchical clustering on such data sets would take substantial time. Besides, the dendrograms remain visually useful only with smaller data sets. We thus require an alternative clustering technique for larger data sets.

A suitable alternative to present to students is the k-means clustering, in particular because it is easier to explain than the more advanced procedures like DBSCAN.

The algorithm

In this pedagogical unit, we use a painted data set to introduce the $k$ -means algorithm. Orange comes with an educational add-on that includes the Interactive k-Means widget. The widget takes the first two numeric features of an input data set, plots a scatterplot, allows the user to place an arbitrary number of centroids, and executes the clustering algorithm step-by-step. To grasp the algorithm's workings, we encourage the trainees to draw their own data set and use the Interactive k-Means to monitor the progress and convergence of the algorithm.

A simple workflow that starts with painted data and feeds it into the Interactive k-Means widget. Users can set the initial positions of centroids and execute the algorithms by, interchangeably, recomputing positions of centroids and reassigning centroid membership of data instances.

Replay

Centroid initialization

Convergence to the optimal solution, where the average distances between data instances and its closest centroid are minimal, is subject to appropriate centroid initialization. Unfortunate initial positions of centroids can lead to a non-optimal solution. Here, we can challenge trainees to design a data set and find the initial placement of centroids to trick the algorithm into misbehaving.

Finding a set of initial positions that lead to a non-optimal solution may be tricky …

Replay

... but not impossible.

Replay

So far, we have assumed that the centroids are placed arbitrarily. In the discussion with trainees, we can find procedures to initialize centroids better and avoid non-optimal solutions. One possible alternative is to repeat $k$ -means several times with random placement of centroids and then choose the best-obtained clustering. Another is using the data instance farthest from all other instances as the first centroid and then iteratively placing other centroids at the data instances farthest away from the centroids already placed. A combination of the two techniques, with some randomization of centroid placement, is called KMeans++ and is used for centroid initialization in the scikit-learn library that is wrapped within Orange's $k$ -means widget.

Scoring of the clustering and search for appropriate $k$

A deficiency of $k$ -means clustering is the requirement to define the number of clusters ahead of the clustering procedure. At this stage, we can engage students in a discussion about how we could overcome this limitation. The discussion should steer toward estimating the quality of the clustering result. We have already discussed the estimate of the centrality of the data instances for a given clustering structure — the silhouette — and in the discussion, students often discover that an average of silhouettes over all data points could be a reasonable clustering quality estimate. We can use this estimate, try a range of different values of $k$ , and pick the one with the best silhouette score. We here encourage students to paint the data, possibly to include a range of clusters, and see if Orange, which implements this approach, can guess the number of clusters correctly.

The $k$ -Means widget in Orange can estimate the number of clusters by evaluating different values of $k$ through a silhouette score of the resulting clustering. The figure shows the workflow where we painted the data set, letting $k$ -Means widget repeat the clustering procedure multiple times in order to find the optimal number of clusters through silhouette scoring and showing the resulting clustering in the Scatter Plot.

Running $k$ -means on a multivariate dataset

At this point, trainees should be familiar with procedures and can use $k$ -means on some multi-dimensional data set and interpret the clusters. In the below workflow we reuse the Zoo data set, find out that $k$ -means through silhouette scoring suggests either four or five clusters, and characterize the clusters, one by one, using the Box Plot and ranking of variables by the $t$ statistics.

This linear workflow reads the data, infers clustering by $k$ -means, and then selects the target cluster using the Select Rows widget. Note that the connection between Select Rows and the Box Plot requires a similar rewiring as when connecting the Hierarchical Clustering to Data and box plot, so that the whole data is passed to the Box Plot and the indicator variable "Selected" is added to the data set.

Chapter 5: Common Gotchas

Like any data analytical procedure, clustering has its limitations: we cannot use it as a magic wand and trust its results without scrutiny. Here is a list of topics that we customarily discuss with students.

Invalid reasoning about pairwise distances from dendrograms

Dendrogram should not be used to reason about pairwise differences between data instances.

There is a common — and wrong — implicit assumption that clustering identifies groups such that "data instance is similar to other data instances in its cluster and different from instances in other clusters." While k-means optimizes this in a certain sense (each instance belongs to the cluster with the nearest centroid), hierarchical clustering optimizes it only indirectly and depending upon the chosen linkage. However, even in the case of $k$ -means — and any other method of clustering we do not cover here --, some data instances are bound to be closer to some instances from another cluster than to some (or even most) instances from their own. There are methods, like $t$ -SNE and multi-dimensional scaling, which we can use to identify such cases, but we do not cover them in this lesson.

K-means assumes spherical clusters of comparable size

Students can easily identify the two clusters from the painted data in this figure.

We use the usual workflow to paint the data, pass it to k-means clustering, and display the results in the scatter plot. Although clusters are clearly separated, some points that should belong to the bigger cluster are closer to the center of the smaller one. Hierarchical clustering finds the expected clusters unless we use complete linkage and the difference in cluster sizes is very large.

We can then show them that $k$ -means algorithm fails at this task. With such unequal cluster sizes, some points that should belong to the bigger cluster are closer to the center of the smaller one.

A similar problem occurs in a case of an elongated, non-spherical cluster, which is also not discoverable by k-means. We let the students explore whether any of these clusters can be recognized by hierarchical clustering, with which linkage, and why.

We use a data like this to show that k-means cannot find non-spherical clusters. Fixing the number of clusters to two does not help. Students can extend this workflow by passing the data to Distances, then to Hierarchical clustering and the Scatter Plot widget, observing that only hierarchical clustering with single linkage can encapsulate clusters of such elongated shapes.

Finding clusters where there are none

We show a scatter plot in that seems to consist of three visible — although not very distinct — clusters identified by the $k$ -means algorithm.

Coloring the scatter plot according to cluster assignment shows illusory clusters.

Next, we show them the same plot without colors.

In the same picture like the one above, just without colors, we no longer perceive any clusters.

Clusters that we perceive in the first plot are just an illusion produced by the algorithm. To further demonstrate that clustering always "succeeds," we paint a cloud of points, run the k-means algorithm, and show the results.

K-means discovers clusters even though there are no discernible gaps in the data. The number of clusters in this workflow is not set manually; $k$ -means may find an arbitrary number of clusters.

We then stress that one can easily spot the problem in these figures of two-dimensional data sets. However, real-world data is multidimensional, which makes it much harder, or even visually impossible, to uncover that data points actually lie within a formless, unstructured blob.

Assessing the quality of clustering is difficult

The above discussion naturally leads to the question of how we can know whether clusters are real. Here, we must first warn any students with some training in statistics against the temptation to compare the clusters using statistical tests — more so because we have been using the Student's $t$ statistic in our lesson. We must remind them that the $t$ statistic served exclusively to measure the mean difference relative to the variance and not as a step towards the eponymous test. Computing $p$ -values to confirm the significance of differences between clusters would be invalid because hypotheses can not be tested on the data from which they were derived.

Instead, we again ask students how they would (intuitively) define a cluster. After we get a useful answer (for example, "a group of points clearly separated from others"), can "good clusters" be observed using the explorative methods that we have introduced so far?

The answer for hierarchical clustering seems to be checking the lengths of branches in a dendrogram: one could reason that a good clustering would be one in which branches are initially short but, after a certain point, rapidly grow longer. To see a problem with this, we show a dendrogram , which should clearly indicate three well-separated clusters.

A dendrogram of the hierarchical clustering of the blob-like data above, using the Ward linkage.

We then show that this is the dendrogram of the clustering from the above example in which we showed how clustering algorithms can find clusters even when there are none.

Why does this happen? The length of the branches represents the values of the linkage function. Clustering uses the linkage function to define distances between clusters, but — as we discussed earlier in the lesson — different situations require different linkages, and not all linkages correspond to the natural notion of distance. In particular, the popular Ward method minimizes the intra-cluster variance at each merge. This only indirectly corresponds to distances and typically begins rising rapidly at some point, even though there may not be any corresponding gap in the spatial structure of the data. We can also show that a single linkage produces a dendrogram in which the lengths of branches better match intuitive distances.

A dendrogram for clustering of the same data using single linkage..

A better indicator of clustering quality is the silhouette plot. We have so far shown its usefulness for detecting outliers. We let the students explore the silhouette of another clustering.

The blue cluster is clearly defined because all its members are much closer to its center than to any other cluster. Silhouettes of the other three clusters do not have such a sharp boundary, which indicates that they are closer to each other. The data for the silhouette comes from the above example for clustering non-spherical clustering.

Which clusters are well-defined and which are not? The desired answer is that the red, green, and orange clusters have some points with low silhouette scores, while the silhouette of all points in the blue cluster is high. We then show that this is the silhouette for data from the above example with non-spherical clusters. The blue cluster is indeed separate, while the others should have formed one common cluster. Silhouette visualization can thus show the quality of individual clusters rather than only the quality of the clustering in general. We emphasize that not finding any cluster could indicate that there are no clusters or that we have not used the appropriate method to identify them.

Linkages for Hierarchical Clustering

In the section on hierarchical clustering, we briefly introduced different linkages for estimating the distance between clusters, i.e., between groups of data instances. These included average, single and full linkage, and Ward linkage. In our practical examples, we used average linkage (also because it is easy to explain), and later, with larger data sets, we switched to Ward linkage. If time permits, we will discuss some of the differences between these distance estimation methods. For example, single linkage, also known as minimum linkage, considers the shortest distance between elements of clusters, which can lead to "chaining" effects where clusters can become elongated and less compact. Full linkage, or maximum linkage, uses the longest distance between elements, often resulting in more compact clusters, but can be sensitive to outliers. Average linkage, on the other hand, uses the average distance between all points in the clusters, balancing the extremes of single and full linkage, but sometimes producing clusters that are not as distinct. Ward linkage minimizes the variance within clusters, aiming for clusters that are more similar internally. This often results in more balanced and interpretable clusters, especially with larger datasets, making it a preferred method in many applications despite its higher computational cost.

It may be easiest to explore the difference between various types of linkages using the painted data. Instructors may, at this stage, ask course attendees to paint the data where, for example, single linkage would succeed and Ward linkage would fail.

Comparison of Different Clustering Methods

While the essence of this course is to introduce clustering and exploratory data analysis methods to understand the results of clustering, the course also introduces various clustering approaches, including hierarchical clustering, k-means, and DBSCAN. These approaches can produce different and sometimes very different results depending on the values of their parameters. For example, hierarchical clustering is well-suited for smaller datasets and allows for easy visualization of the dendrogram, making it useful for understanding hierarchical relationships but may struggle with larger datasets. K-means clustering works efficiently on larger datasets but is sensitive to outliers and requires the number of clusters to be specified beforehand. DBSCAN is robust to outliers and can find clusters of arbitrary shapes, making it ideal for datasets with noise and varying densities, but it can be challenging to visualize and interpret the results directly.

If time permits, instructors can include a short lesson on comparing the results of clustering using these methods. Such a comparison requires the introduction of at least two other widgets, one for merging the data tables with the clustering results, and the other for constructing and displaying a pivot table that shows the confusion matrix of the two clustering methods. For example, the following workflow compares k-means and hierarchical clustering. We have used the Zoo dataset, where hierarchical clustering with Euclidean distance and Ward linkage and k-means clustering discover a similar clustering structure. Students can explore and comment on the differences by selecting cells in the confusion matrix and observing the corresponding data instances in, for example, the data table.

This workflow uses Edit Domain widgets, where we have renamed the feature to store the results of the clustering, so that Cluster was renamed to "k-means" for k-means clustering, and the same variable to "hierarchical" in the output of hierarchical clustering.

Clustering vs. Classification

Clustering is about finding groups in the data — unlike predictive (or supervised) modeling, which is about describing predefined groups, in machine learning referred to as classes.

We noticed that students are often confused and use clustering "to discover classes in (supervised) data" or, for instance, find clusters and check how well they correspond to actual classes — with a good match supposedly indicating that "the classes are assigned correctly." We keep stressing that clustering and predictive modeling are two distinct tasks. When the data consists of several groups (classes), one uses predictive modeling to learn these classes' properties and make predictions on future data. Clustering provides neither of these and, in this context, only discovers what we already know.

In a task that we often give as homework after students learn about clustering and predictive modeling, students have to use logistic regression and k-means clustering on data set GDS4168 from NCBI GEO (classification of peripheral blood B cells isolated from untreated chronic lymphocytic leukemia; the file prepared for loading into Orange can be downloaded from http://file.biolab.si/files/GDS4168.zip). One question they must consider is which method finds a better fit for the given classes and why. This data is interesting because the two classes are easily separable, yet the separation goes across clusters and not along them.

A common wrong answer — which provides us with useful feedback about students' understanding of the difference between supervised and unsupervised learning — is that logistic regression worked well, while $k$ -means failed "because clusters contain a mixture of data instances from both classes." In the follow-up lesson, we discuss that predictive modeling finds a boundary between points of different "colors," while clustering is color-blind and finds a boundary that follows the gaps. Clustering thus did not fail to find clusters that match the two classes: it was never supposed to look for them.

Logistic regression (which we cover in other lessons that are not a topic of this article) shows almost perfect performance under cross-validation. The two clusters found by k-means contain a mixture of instances from both classes, as shown in the Box Plot. The fact that clusters do not correspond to classes, although the classes are easy to separate, demonstrates the difference between supervised and supervised modeling.

Hands-on Training about Data Clustering

Pedagogical Approach

Software platform

Acknowledgments and Licence

Chapters

Chapter 1: Hierarchical Clustering

Data

Distances between data items

Linkage

Dendrogram

Clustering in multi-dimensional space

Chapter 2: Cluster Explanation

Characteristic features

Framework for visual analytics of hierarchical clustering

Change to a more complex data set

Chapter 3: Inliers and Outliers

A silhouette score

Painted data and silhouette plot

Outliers in the Zoo data set

Chapter 4: K-Means Clustering

The algorithm

Centroid initialization

Scoring of the clustering and search for appropriate kkk

Running kkk-means on a multivariate dataset

Chapter 5: Common Gotchas

Invalid reasoning about pairwise distances from dendrograms

K-means assumes spherical clusters of comparable size

Finding clusters where there are none

Assessing the quality of clustering is difficult

Linkages for Hierarchical Clustering

Comparison of Different Clustering Methods

Clustering vs. Classification

Scoring of the clustering and search for appropriate $k$

Running $k$ -means on a multivariate dataset