Classification Trees
Introduction to Orange and Machine Learning
This book includes questions. The number of attempts is unlimited to encourage you to try to find an answer yourself. Afterwards (or after giving up), read the explanation because it usually contains more than just the answer.
The material is offered under Create Commons CC BY-NC-ND licence.
Chapter 1: Orange and Widgets
This part has two goals: one is to learn more about the algorithm for construction of decision trees, which we will introduce a bit more formally. The other is to finally put your hands on Orange and learn how to use it.
We use the algorithm for construction of classification trees on a similar data, the Zoo data set. The data set is one of the standard data sets used for demonstrating machine learning algoirthms. It contains 101 instances (rows, examples) and 17 variables (columns). The first 16 are the features (attributes) of animals, and the last is the class label (target, outcome). The class label has 7 distinct values: mammal, bird, fish, insect, invertebrate, amphibian, and reptile. The features are binary, except for the number of legs. The data was prepared by computer scientists from Carnegie Mellon University, which explains why an attribute values is "0" if an animal does not possess a feature and "1" if it does.
When we open Orange and close the welcome window, we find ourselves in front of a (mostly) empty window. On the left side is a drawer with various building blocks. We place those blocks on the right side, the canvas, and connect them together.
If we feel that the drawer is taking up too much space, we can hide or shrink it: press the (barely visible) double arrow on the top left above the drawers.
Right-clicking on the canvas brings up a menu with a few widgets. Start typing a name, part of a name or a keyword and the selection of widgets narrows down. Here we need the File widget, so we type "file" and when we see the widget we want, we click on it.
ReplayA drop down at the top of the widget shows the currently loaded data. Click it to see the list of most recently used data sets. Initially it contains the documentation data sets.
Double click the File widget icon to open the widget. We can now load the Zoo data set: it is distributed as a documentation data set for Orange, so we can click the Browse documentation data sets button at the button and select the Zoo data set.
The data is now loaded. And? The file widgets shows us a short description, lists the column names ... but it doesn't show the data.
Each widget serves a particular function. The File widget loads the data, but it doesn't show it. So we add a Data Table widget. We can add it like the File widget. Right-click somewhere on the canvas, preferably just to the right of File, type the start of the name and select the widget from the menu.
Now we have to connect the two widgets. Widgets have "antennas"; most have two, some - including the File widget - have only one. The antennas are for data transmission: we use the mouse to connect the output (and only) antenna of the Dataset to the Table.
ReplayDouble click the Data Table widget and it will open the table with the data that it got from the File widget. Whatever is loaded by File is shown in Data Table.
We can sort the data by clicking on the column names. There are 101 animals, but only one which name starts with 'n'. Which?
Another way to add and connect a widget is to, ekhm, connect it before adding. :) Just drag a line from File output to an empty space. Orange will show you a list of widgets that can accept the data. You can then select the widget you want to add. Let us add a Box Plot widget.
Box plot (also called a whisker plot) shows statistical properties of numeric variables. Data about animals is categorical, and Orange's Box plot shows it as a bar plot. Initially it shows the distribution of the first attribute in the data (in this case the hair attribute), and each bar corresponds to a category. We see that only two kinds of animals have hair: mammals and insects. Half of insects are hairy, and only a few mammals are not.
To discover which, we can click on the bar. Let us do the mammals first: click on the short, blue part of the bar for mammals and ... Not much happens. The bar is selected but we don't see anything new. Sure: this widget will not show the data - it only shows bars. Luckily, we already now a widget that shows a table, therefore: connect the Box Plot a new Data Table and open it.
Which two mammals included in this data do not have hair?
How many insects in this data are hairy?
What about the number 101 following the number 4? The Box plot outputs two data tables, but the other one doesn't concern use yet.
Warning: spoiler! There is a more convenient way to answer this last question. See the bottom row of the widget. It shows three icons (feel free to click them, they aren't dangerous), followed by information on the number of input and output data instances. The widget receives 101 data instance (from the File widget), and outputs 4 -- which is the answer to the question.
But there's more! This part of the widget also acts as buttons! Click on it and get a preview out the output table(s). We could use this to answer the first question without adding a new Data Table widget.
Let us forget about the outputs for a while. And about hair, too.
Fish are well known for their attachment to water, but there are mammals that prefer aquatic lifestyle, too. Not only mammals. In fact, only animal group (in this data) contains no aquatic animals. Which?
In the initial setup, the Box plot widget shows the proportion of each animal within a group that has a certain property, say hair or being aquatic. All amphibians and fish are aquatic, and so is a weak majority of invertebrates. Hovering with a mouse over the bar for aquatic invertebrates reveals that their proportion is 60 %. (Remember this trick, you'll need it soon!)
As a literal side note: if the intent of visualization is to compare proportions, never used pie charts. Putting two bars parallel to each other allows us to compare proportions in one sight, while pie charts force the viewer to jump from one to another.
By being drawn in parallel, the bars also allow us to compare proportions of aquatic animals in each group: after the insects, the least water-loving animals are mammals, followed by reptiles and then birds.
We learned: widgets have just one job -- one may load the data, another show a table, and another show bars. But they their job well and they are versatile. Uncheck "Stretch bars" and the same widget will use the bars to show something else.
How many mammals are in this data?
How many aquatic mammals are in this data?
We could have answered this question with what we knew before: click on that part of the bar. The point of the question is that we now how another way to find this number: hovering over the bar now doesn't show the proportion but the absolute number of animals. This is because this visualization is now no longer about proportions but about numbers. We can no longer say whether the proportion of aquatic birds is larger or smaller than the number for aquatic reptiles, but we see that there are many more aquatic birds than reptiles -- at least in this data.
We can also reverse the roles of variables: select "type" as the variable and "aquatic" as the value. Now we see that we have 65 non-aquatic animals and 36 aquatic animals. We can stretch bars to compare the proportions or keep them as they are to compare the numbers -- at any rate, we see that this feature distinguishes between insects versus amphibians and fish, and but otherwise it is not very informative.
This will immediatelly lead us to our main topic, classification trees, but let us stop for a while and summarize in two points what we have learned in this introduction. First, Orange is used by connecting widgets that send data to each other. This is a very powerful idea and what we've seen so far is just the beginning. Second, each widget by itself is a versatile tool that we can twist and turn to extract different information from the data.
Chapter 2: Splitting hairs and feathers
Knowing whether an animal is aquatic or not is a very informative feature if we want to know its type. Which feature is?
The way our box plot is set now, the type is a variable and we would like to find such subgroups (based on the values of features) that would be as different as possible. Try, airborne, for example. (This works best with stretched bars.)
Not bad, not bad at all: if it flies it's a bird or an insect, otherwise it is not. Well, almost. Is there a better feature?
Which of the following features is the most informative?
The Box plot can saves us the work by automatically ordering the features: check "Order by relevance to variable".
The feature at the top is a bit unfortunate: the number of legs. Features with many different values will tend to be scored higher; this is a known effect, but we shall not dwell on it. Just ignore them.
The following features are more interesting.
Having feathers is exclusive to birds. Giving milk is exclusive to ...?
On a(n even more) serious note, would you having a backbone an informative feature?
Let us start formulating a set of rules -- which will take a form of a tree -- with having milk. This separates the largest group of animals -- mammals -- from the rest. What now?
Before we continue, let us reveal where we are going. In a classroom activity we are adding individual animals to the data, which forces the tree to adapt. This is not how the algorithm actually works. The algorithm is not incremental (there are such implementations, but they are used for specific circumstances), but gets all the data at once and works by splitting the data into ever smaller subgroups. Here we shall use a box plot to simulate it manually. Later we shall describe it a bit more formally.
There are simpler ways to select data subset in Orange, but let's stick to what we know.
So: our tree so far consists of a root node in which we split the animals according to whether they give milk (mammals) or not (everybody else). We now need to find the most informative feature for the latter. One way to do so is to connect the Box plot to ... another Box plot! In the first one, we need to select all animals that do not give milk. Click on one of them and then Ctrl-click (or Cmd-click on macOs) on the others. The second Box plot will contain the selected animals only.
We set variable in the second Box plot to type and order subgroups by relevance to the variable.
Ignoring the number of legs, which of the following features is the most informative for the animals that do not give milk?
Let us split the animals into those with and without feathers. This separates birds from the rest. We could add another Box plot, or we can take a shortcut and just unselect birds in the first Box plot by Ctrl-clicking (Cmd-clicking) them.
Just to check that you are still with us: after excluding mammals and birds, which features follows legs and backbone?
Let us continue with backbone. It splits the data into two subgroups: insects and invertebrates on one side and everybody else on the other. Unlike before, when we were done with one group and only had to deal with the rest of the data, we have to continue the procedure on both side -- the animals (that don't give mild and don't have feathers) with and without backbone. In the former group, which includes amphibians, reptiles and fish, having fins would separate the fish, while amphibians and reptiles can mostly be distinguished by whether they are aquatic or not.
But let us not go there, it could get messy or boring or, most likely, both. We have computers for this.
Chapter 3: Classification trees
Note something new happenning here: previously, widgets sent data to each other. The tree widget does not send (and the Tree Viewer does not expect) data but a tree. The output of a Tree cannot be connected to the data, and the input of the Tree Viewer cannot be connected to the output of File or of Box Plot.
The widget that constructs classification trees is called Tree. We shall connect it to the File widget. Given the data, the widget outputs a tree. Opening it shows nothing interesting for now, so we shall just connect its output to a widget that shows the tree: Tree Viewer.
How does the Tree widget construct a tree? It uses the algorithm we have used to manually construct the tree. It looks for the best feature to split the data and then splits it. It then looks for the best feature to split the data in each of the two subgroups and splits them again. It continues the splitting until it cannot or should not split the data any further.
As for any algorithm that is implemented in a machine, we cannot be vague. What is "the best feature" and what do we mean by "cannot or should not split the data any further"?
Well, for that matter, we haven't been very curious about how the Box Plot orders by relevance, have we? For those who know some statistics: box plot orders variables by their p-values as computed by t-test or ANOVA. For others: it doesn't matter because trees use other measures that turned out to work better for them. They are all based on observing some kind of impurity: our goal is to split data into subgroups that have, eventually, just instances of one class. The impurity measures how much the classes are mixed in a subgroup.
You may have heard about the Gini coefficient, which is used to measure inequality. Same Gini and a similar idea, except that it measures the inequality over numeric variable, such as wealth, and not over discrete categories, such as animal types.
The simplest definition of impurity is the Gini index, named after the Italian sociologist Corrado Gini. In a group of instances, we pick a random pair. What is the probability that they will be different? The higher the probability, the greater the "impurity" of the group. The best variable is thus the one that results in subgroups with the lowest average impurity.
Another popular measure of the atribute quality is information gain. It is based on the entropy, a measure of uncertainty. Uncertainty is the highest when all events are equally probable, lower when some events are more probable than others, and the lowest when one event is certain to happen. The entropy of a coin toss is 1 (it may be heads or tails) and the entropy of a Slovenian train coming on time is close to 0.
Both measures -- and most of the dozen of others -- exhibit the phenomenon that we mentioned when ordering features: they tend to assign higher scores to features with more values.
This settles the question of variable selection. What about stopping?
One thing is obvious: there's no need to continue when the subgroup is pure. We may be able to tolerate a certain proportion, like 5 % of instances from another class. We may also limit the depth of the tree, or the number of instances in a subgroup. These are all parameters of the algorithm, which we can see and set in the Tree widget. We can also tell it to induce a binary tree: this forces it to split the number of legs into two subsets of values, and prevents preferring multi-valued features over the binary ones. In the above tree, the number of legs is used in this way for insects and invertebrates.
Quadrilaterals
Classroom simulation
This task is based on lesson Quadrilaterals from Pumice.
In the past, we have printed out the quadrilaterals and used Google forms for entering the data. Just before this workshop we have prepared a web-based app, but it is not well tested yet. We apologize in advance for any problems, and we have printed the quadrilaterals as a fallback plan.
In this first part, the group should use a single device (computer or tablet).
The URL is created by the teacher for a particular lesson.
- Enter the provided URL in the browser. Set the name of the group.
- You'll be shown a number of quadrilaterals. Carefully check their properties and categorize them. Remember that others will see your mistakes. No pressure. :)
- When done, proceed to the next task (even if other groups have not finished yet -- we will use the results of this task later).
Tree construction with natural intelligence ...
In the morning, we constructed a tree for distinguishing between different kinds of animals. Quadrilaterals can be distinguished in a similar way. Individually or as a group, draw a tree for classification of quadrilaterals based on the same set of properties that we used above:
- all sides have the same length,
- two pairs of equal sides,
- one pair of parallel sides,
- two pairs of parallel sides,
- all angles are right.
This time, do not use any data. Construct a tree that matches your (hopefully correct :) knowledge of quadrilaterals.
The message of this part is that classification trees are "natural": they were invented by early researchers in AI and were fashionable for a long time because they mimic our way of thinking about some problems. We could (and have) used such trees long before the age of computers and AI.
... and with artificial intelligence
Now for individual work.
- Open Orange. If you haven't done so already, install add-ons that we will need for the workshop. Go to Options > Add-ons, and select "Educational", "Geo" and "Image Analytics". Click OK and let Orange restart.
- Download the data about quadrilaterals and their images, and unzip it.
- In Orange, use the File widget to load the data from quadrilaterals-correct.xlsx. This file refers to the same shapes that you described and classified in the previous task, and it is (most probably) correct. Set shape (at the bottom of the list) to class (not feature). This will tell the Tree widget that this is the target variable that is to be predicted. Click Apply.
- Construct a classification tree. In the Tree widget, uncheck Min. number of instances in leaves and Do not split subsets smaller than. With this, the algorithm will not prematurely stop splitting the data.
Compare the tree with the one you have drawn. Are they the same? If not, which one is correct?
The message now is that computers can "learn" from data by abstracting it in such models. While the data is about some concrete shapes, the model is abstract and can be used to classify new shapes. This is the essence of machine learning.
... analysis of our own data
For this part, you will have to wait for all groups to finish the first task, classification of quadrilaterals.
-
Use the provided URL to download the data. Put this file into the same folder that contains files from the zip. (This step is necessary to simplify loading of images. We could also load the file into Orange directly from the URL, but Orange wouldn't know where to look for images.)
-
In the File widget, load this file instead of the one with correct data. (Don't change the workflow, just choose a different file in the File widget.) What does the tree look like? Correct or not?
If incorrect, it must be due to wrong data. Can you find the reason? Inconsistencies may come from two sources: either wrong categorization of a shape or wrong description.
Connect an Image viewer (if it is missing, you have not installed the add-on) to the Tree Viewer. Click on suspicious leaves and look at the images. (The first setting in the widget, "Image file name attribute" must be set to the name of the column that contains the file names of images. For this data, it must be set to "image". If this is not done automatically, change it manually.)
If the leaf says "square" and the image is a rectangle, the categorization is wrong. Similarly, the leaf may say that, for instance, two out of three shapes are kites, but one is a rhombus; observing the images may show that they are all kites.
On the other hand, there may be an error in the description, that is, in the properties assigned to the shape. Connect a Data Table widget to the Tree widget or the Image Viewer and try to find inconsistencies in descriptions of shapes. The shape may have two pairs of parallel lines, while the description says that it has only one.
... and of students' data
If you and other groups have been to dilligent, you didn't have much to do in the previous part. Now load the table recording the answers from an actual classroom: instead of the table with your data load quadrilaterals-classroom.xlsx and observe the resulting tree.
The following questions will lead you through the analysis of the tree. Read and answer them in sequence; many questions contain spoilers for the previous question. Also, do not skip reading the explanations!
The first split in the tree is based on the number of parallel sides. True or false?
The first split separated two shapes from the rest. (They are most probably shown on the right-hand side from the root; the following questions will assume this is the case. Flip the sides if your tree is different for any reason.) Which ones?
The right-hand branch is split into two leaves that are pure, that is, one contains just shapes classified as squares and the other contains just rectangles. But do they contain *all* instances that children recognizes as one of those shapes? Or do any squares or rectangles appear (perhaps due to wrong data) elsewhere in the tree?
This classification was made by children, so it doesn't mean that it contains all shapes that are in fact squares and rectangles. What it *does* tell us is that when children described a shape as having all right angles, they classified it as a square or a rectangle. Do these two leaves contain any shapes that are actually *not* rectangles or squares? (Hint: the data also contains a column with correct shape.)
There is a nice way to explore this: using the Image Viewer, which allows us to see all images that students' descriptions places in a particular leaf. Connect the Image Viewer to the Tree Viewer and click on nodes within the branch with squares and rectangles. You will see that the image is indeed a kite. Connecting the Image Viewer to a Data Table allows us to see the description of the shape. Students have described a kite as having all right angles and thus misclassified it as a square.
Note that the correct shape column is placed among "meta attributes". This tells algorithms like Tree not to include it in the modelling.
Is there a way to identify all misclassified shapes, see them in the Image Viewer, explore their properties in the Data Table? You bet.
Connect a Pivot Table to File. Set Rows to Shape and Columns to Correct shape. Can you guess what the table shows?
The numbers in each cell show how many instances of a particular shape were classified as a particular shape. For instance, the cell in the row "square" and the column "rectangle" shows how many rectangles were misclassified as squares by children.
The table only counts instances for which both values, shape and correct shape, are defined. The total number of instances is 42, but the sum of all cells is 35. This means that seven instances are missing from the table. These are the instances that children did not classify.
Based on the bottom line of the Pivot table, we can be sure that they sometimes missed at least three shapes. Which?
Which was the most common mistake?
Click the cell in the Pivot table that corresponds to this error. Connect the Pivot table to the Image Viewer and you'll be able to see these four shapes! You can do the same for other mistakes.
The Pivot Table has multiple outputs. The default contains the data shown in the widget. We would like to have the filtered (selected) data, so we have to rewire the connection by double clicking it and selecting the other output.
ReplayA temporary note: due to a recently discovered bug, the Image Viewer may sometimes show an error when it first receives the data. Should this happen, change Image file name attribute (top left) to "correct shape" and then back to "image".
Just to force you to try it: the four misclassified trapezoids include some hedgehogs, fish and frogs. Which of them is the most common? :)
Interesting fact: the children did not use the same images. These four trapezoids are now drawn so that the parallel lines run horizontally. Previously, they were rotated, and it was indeed too easy to overlook that the two sides are parallel -- not only to children but also for teachers. We thus decided to simplify the task by making them horizontal.
What about them hedgehogs?
The reasons for including the animals are threefold.
They represent redundant information. They are distributed so that the same number of each animal appears with each shape. Caring no information, they thus shouldn't appear in any tree - and they indeed don't in the tree induced from the correct data. The tree induction algorithm is not told that the animals are irrelevant, but it still doesn't include them in the model because there is no situation in which they would be of any use.
Not so for the tree induced from the data prepared by children. This is because of mistakes they made. The usual scenario would be that they assigned the same properties but different categories to two shapes, say a square and a kite. If square contained a hedgehog but kite did not, the algorithm is (mis)led to believe that the presence of a hedgehog is a good predictor of a square...
The third and the real reason is that they are cute.