HomeData Exploration and Clustering

This quiz consists of 0 mandatory questions where you can gather 0 points.
You will complete the quiz by answering all the questions and gathering at least 0 points.

Answered: 0 / 0

Achieved points: 0 (NaN %)

Data Exploration and Clustering

A gentle introduction to AI and machine learning.

Which NBA player handles the ball the most like Luka Doncic? Ever wondered which tennis player's serving style is most like Serena Williams? Which cheese has the closest flavor profile to Brie? In the vast array of tropical forests, which has the most in common with the Amazon? Which contemporary film director can be compared to the legendary Alfred Hitchcock in terms of cinematic style? Is watermelon really similar in nutritional value to grapes and cherries? Is the world really divided socio-economically from North to South? For those entranced by the prose of Jane Austen, which modern author comes closest in narrative style?

In this course, we will not find answers to all of these questions. But we will learn approaches that might. All of these questions are easy to answer if we have the right data. We can use the data to represent objects of interest. For example, we can represent foods with their nutritional values, tennis players with their service statistics, and countries with their socioeconomic indicators. From here it is easy: we can use the data to assess similarities, find groups of similar objects, and interpret these groups to find their characteristics.

Data profiling, similarity measurement, and data clustering are essential tasks in machine learning. Therefore, they are a good place to start our series of tutorials on AI.

The tutorial is hands-on: to complete it, you will watch the videos and answer quizzes. To get started, please download and install Orange, our choice for data science software, which is used throughout the tutorial series. There are many other similar machine learning tools out there, but we chose Orange because of its simplicity and ability to combine visualization with machine learning. And, ok, because we at the University of Ljubljana are its authors. However, the course is not about Orange, but about the core concepts of machine learning.

If you get stuck at any point, there is a Discord chatroom where you can ask us for help.

Welcome to the Data Exploration and Clustering class! And yes, watermelons are nutritionally similar to grapes and cherries. And the world is still a divided place. Read on and watch the videos to see how we can figure that out from the data.

The material in this course was prepared is offered under Create Commons CC BY-NC-ND licence.

Chapter 1: Data exploration

... where we start using Orange, learn the mechanics of visual programming, and get started with some basic data analysis.

We begin our hands-on tutorial with socioeconomic data from the Human Development Index. Which country has the longest life expectancy? Is life expectancy related to years spent in school? In which part of the world are there countries where life is too short? Here is a video where we use Orange to answer some of these questions.

The video showed how to use Orange and its visual programming interface to load and explore the data. Now it is up to you, dear trainee: use Orange, load the Human Development Index data using the Datasets widget, and display the data in the Data Table. You will need to construct a workflow similar to the one shown here:

Open the Data Table widget (double-click the widget icon) and use it to answer the following questions:

How many countries does our data include? (1pt.)

You have 3 attempts.

According to our data (published in 2015), which is the country with the longest lifespan? (1pt.)

You have 3 attempts.

Which is the country with the largest gross national income per capita? (1pt.)

You have 3 attempts.

Now use the scatterplot to show the relationship between life expectancy and average years of schooling.

Which country sticks out with relatively high number of years their citizens spend in schools and relatively low live span? (1pt.)

You have 3 attempts.

Use the Scatter Plot widget to find out which of the following indicators best correlates with mean years of schooling. (1pt.)

You have 3 attempts.

If you are wondering how to save your work in Orange, here is the video with explanation. We will continue to look at the relationships between the two variables. You have already thought about correlations for the question above. Correlations between variables can be quantified, and the following video shows how.

What is the Pearson correlation between life expectancy and mean years of schooling? (1pt.)

You have 3 attempts.

Check out the Wikipedia page for Pearson correlation coefficient.

The more the two socioeconomic indicators are positively correlated, the closer the Person Correlation should be to one of the following values. (1pt.)

You have 3 attempts.

Let us change the dataset and look at the employee data. Open the Datasets widget and load the employee turnover data.

Just a note: we will use this data here (and for now) only because it profiles the employees in an organization, and we will not explicitly relate the employee profiles to the question of whether the employee will stay or leave the organization, as captured by the Attrition attribute.

How many variables does employee attrition data set use to profile the employees? (1pt.)

You have 3 attempts.

Which of the following variables is monthly income most closely related to? (1pt.)

You have 3 attempts.

Often the answers we get from the data are not so surprising, but they are still valuable if they confirm our beliefs or are consistent with common knowledge. Looking at the data by observing it in a spreadsheet or visualizing it in different graphs allows us to become familiar with it and helps us spot any data errors. It helps to check the data thoroughly before we do any machine learning. So far, we have only used the scatterplot for visualization, and it is time to familiarize ourselves with two other visualizations, the bar chart (the distribution widget) and the box plot.

Using the attrition data set and the Distributions widget, answer the following questions. Please note: when the class variable is present, Orange automatically uses it to split all the distributions, so that they can be compared by the class value. For now, disregard the class by setting "Split by" in the Columns pane to None.

Are there more males or females in the attrition data set? (1pt.)

You have 2 attempts.

What is the educational background of most of the employees in this data set? (1pt.)

You have 2 attempts.

Where do most employees live in relation to the company's quarters? Answer this by looking at the distribution of distance from home. (1pt.)

You have 2 attempts.

Now use the Box Plot. Set Subgroups to None, as in the following screenshot; again, we are not interested in the differences between those who leave and those who stay. Orange's Box Plot uses standard visual notation to show things like the mean (blue bar), the median (yellow bar), and the values of the 1st and 3rd quartiles. You can always get help on a particular widget by clicking the "?" icon in the widget's status bar at the bottom left of its window.

Using the Box Plot widget, answer the following questions:

What is the mean age of the employees in the attrition data set? (1pt.)

You have 2 attempts.

What is the median age of the employees in the attrition data set? (1pt.)

You have 2 attempts.

Chapter 2: Distances

... where we use the data to find similar data items, and quantify similarities by measuring distances in multi-dimensional spaces.

We agree that the term "multidimensional" sounds kind of sophisticated. But measuring distances between data points, as we will learn, is just like measuring distances in real life with rulers and tape measures. In data science, we measure distances to find neighbors and outliers. We ask questions like: Is the planet Venera more like Jupyter or Mars? Is the company Siemens more like Proctor & Gamble or Google? Which shampoo is like Head & Shoulders, but maybe cheaper? And nutritionally, are tomatoes more like cauliflower than sweet potatoes? If we have data to profile each of these items, finding answers to all of these questions is not only easy, it is domain independent.

Let's dive into measurement of distances first. We start with distances in two dimensions.

It turns out, as we will see in future videos, that measuring distances in multiple dimensions is just as easy as it is with two-dimensional data. For now, we will pretend that we know how to do this and consider a multi-dimensional data set and the same set of widgets as in the video. Let us use the data on nutritional information food ("Food Nutrition Information") available from the Dataset widget.

How many food items are in the nutritional information data? (1pt.)

You have 3 attempts.

Which of the food types is best represented in this data set (hint: use Box Plot)? (1pt.)

You have 3 attempts.

Estimate the distances between different food items. For a distance metrics, use Euclidean distance on normalized data features (open the Distances widget and make sure this is the selected metrics). Visualize the distance in the Distance matrix, and answer the following question.

Which of the following foods is asparagus most similar to in terms of nutritional value? (1pt.)

You have 3 attempts.

Even for such small data, the distance matrix is large, and as the data gets larger, it becomes difficult to summarize anything from it. For example, it takes some patience and time to find out which foods from our database are most similar in nutritional value to, say, asparagus. It turns out that they are green beans, summer squash, and peaches, in this order, with the last one being somewhat unexpected. While we can check these out from the distance matrix, there is another widget in Orange called Neighbors that we can use. Here is the workflow:

Notice that the Neighbors widget receives two inputs: all the nutritional information from the Datasets widget, and the foods we selected from the Data Table. We set up this workflow so that we first connected Datasets and Neighbors to let Orange know that this is where all the data is coming from. We also instructed Neighbors to output only the three closest data instances.

Construct the same workflow to answer the following questions. You can also play with the layout of your screen and the many windows Orange creates, one for each widget, using Orange's Window menu. For example, there is a useful shortcut to bring all widget windows to the front.

Nutritionally, which is the most similar food to the grapefruit? (1pt.)

You have 3 attempts.

Nutritionally, which is the most similar food to the lobster? (1pt.)

You have 3 attempts.

Nutritionally, which food item is not in the top three most similar items to plums? (1pt.)

You have 3 attempts.

Works great, right? With just a few nutritional characteristics describing the food, we can actually find similar foods that make sense. The last workflow was also interesting, as it combined several widgets to create a nearest neighbor browser that reacted instantly to any change in the data selection. Visual programming and visual analytics in action. And we have only just begun our data science education!

Chapter 3: Clustering

... where we learn about our first machine learning approach, a hierarchical clustering, and use it to find interesting groups of foods and countries.

The nearest neighbors we find also suggest that there are likely groups of foods in our dietary information data. Instead of just nearest neighbors, would it be possible to find the entire clustering structure? And display it in a nice visualization? There is such an algorithm, and the result is a visualization called a dendrogram. Watch the videos below to learn about it. The first video talk about measuring distances (and with these, similarities) between clusters.

The second video introduces the visual representation of the clustering result, called a dendrogram.

And the final video in our hierarchical clustering series discusses something we already know intuitively and have used before: estimating distance in multidimensional data.

Here we will work again with the food information data from our previous chapter. We start with a simple question, just to remind us that the data is dimensional.

Our nutritional data describes foods with numerical features that place each food in a multidimensional space. What is the dimensionality of that space? (1pt.)

You have 3 attempts.

Now use hierarchical clustering and the resulting dendrogram to answer the following questions. Keep the widget settings at default, that is, Euclidean (normalized) distance metrics in Distances and Ward linkage in hierarchical clustering.

Wich other foods are in the cluster with watermelon? (1pt.)

You have 3 attempts.

All fish are in one big cluster? (1pt.)

You have 2 attempts.

All fruits are in one big cluster (1pt.)

You have 2 attempts.

What food cluster with mushrooms? (1pt.)

You have 2 attempts.

Chapter 4: Explanations

... where we will find out that it is not only important to develop models, but also to understand them.

Nearest-neighbor and clustering on our food nutrient data set produced results that were intuitive and made sense. But to better understand what was happening, we would need to know why the clustering grouped certain foods together. That is, what is common to items in selected group. For example, what makes salmon, catfish, and tilapia different from everything else? What features define these groups and how? Is a cluster of fish specific in terms of fiber, cholesterol, or potassium content? Actually, it is in none of these :).

We need to find ways to characterize clusters. Here is video that shows how.

Use the workflows and combinations of hierarchical clustering and box plots to explain the food clusters found from nutritional data. An example of such workflow is below. Pay attention to send all the data from the Hierarchical Clustering widget to the Box Plot, and to choose "Selected" for a subgrouping in the Box Plot.

Using a similar workflow as above, try answering the following questions.

What characteristic best describes the cluster containing rainbow trout, salmon, and tilapia (a group of six different fish species)? (1pt.)

You have 2 attempts.

What characteristic best describes the cluster containing radishes, cucumber, tomato, asparagus, and summer squash? (1pt.)

You have 2 attempts.

Nice, right? The widgets in orange, in our case Hierarchical Clustering and Box Plot, can be arranged to form an interactive browser of clusters with explanations. As shown in the screenshot below.

We have come a long way from the simple data exploration we started with at the beginning of our course. One more chapter and we are done with our introductory course. While hierarchical clustering is one of the most popular approaches to visually discovering groups of data, it is, of course, only one of many different techniques for finding clusters. In the next chapter, we will look at a slightly different set of approaches that can also lead to beautiful visualizations.