Image Analytics + Clustering + Classification

Group work using images of typical houses in different Slovenian regions

This part of the workshop will again be (loosely) based on a lesson plan, https://pumice.si/en/country-development/, followed by a more formal description of the algorithm.

Chapters

Image Analytics + Clustering + Classification

Image Analytics + Clustering + Classification

This part will connect most of the topics we covered: classification, clustering, image analytics … and then some. Beginning with short lesson on Slovenian traditional architecture.

Slovenian geography and climate is diverse despite the small size of the country. This is reflected in different architecture of traditional houses. The most distinct examples are Pannonian, Alpine and Karstian houses.

Pannonian region (north-east of the country) is flat, with fertile soil, a lot of water and sun and mild winters. The architecture of traditional houses reflects these conditions: a typical pannonian house is a one-story building covered with straw roof. They are often - but not always - built in the shape of the letter L.

Alpine region (north-west of the country) is mountainous, with a lot of snow (we are talking about 19th, not 21st century, of course). The climate is alpine, with cold winters and mild summers. A typical alpine house is a two-story house with a steep roof. The roof is often covered with shingles made of wood or schist (a kind of greyish stone, cut in rhomboids). The walls are made of wood or stone.

Karst (south-west) has a lot of limestone and little soil, not much water and a lot of wind, which makes summers hot and dry, and winters harsh even when without snow. Karstian house would be made of stone, heavy roofs (sometimes with stones on top to prevent the wind from blowing them away) and small windows. Karstian houses look rough, sturdy and durable.

Manual classification of houses

This part is based on a school activity, which has not been published yet.

Split into groups of, preferrably, five people.
Each group will get a set of cards with images of houses. Classify them into four groups: pannonian, alpine, karstian and other. (You may want to include a Slovenian in the group, but it may be more fun if you don't. Perhaps (s)he would provide too many spoilers.)
If there are five people, you can split the work so that four members are responsible for four type of houses. Each of them takes one group and orders the cards by the numbers. The fifth member opens the provided URL on a computer or a tablet, and enters the group name. Then (s)he calls numbers and the person who has the card with that particular number tells the name of the region.

If the group has less (or more) members, organize the word as you wish.

This task could obviously be done without the printed cards, but it is more fun this way. Also, seeing all images at once may make the task easier because it helps us look at the differences.

Check the results

From here on, groups are dissolved. We'll do this individually.

In Orange install the Geo Add-on and Image Analytics Add-on if you haven't already. (You can do this by clicking on Options -> Add-ons -> Geo, and -> Image Analytics and then Install Add-on.)
Download the zip file with images of houses and unzip in to some folder.
Download the data from the provided link (it will include a two-word identifier created when we begin the activity). The downloaded data will contain all answers by your group and by others (so far), and also the correct answers (column house type).

Put the file into the same folder as images. (This is necessary so that you can see the images. Orange could pull the data directly from the server, but there is no good way to tell it where to look for images.)
Open Orange and load the data. Check it in the Data Table to see if it looks OK.

A quick check: verify that all columns are marked as meta (not feature), except for the house type which must be a target. This will prevent clustering and classification from using your classification of geographical position of the houses, and force them to use only the images.
Connect the File widget to the Geo Map widget. Latitude and longitude must be set to the corresponding columns (they have probably been set automatically, but you should check). For Color, you can choose the name of your group, of another group (for bragging purposes, if possible) or the correct answer (house type).

Note that some houses may be located at the same or very close place. To drag them apart, add some Jittering. In particular, you may have misclassified a house in Novo mesto. You may want to increase the opacity of the dots, too.

So, how were you doing? Are the colors of your dots grouped by regions, like above, or are they all over the place?
Your task now is to play with the data. For instance, can you find a way to see the images of individual houses that you missed? (Hint: Geo Map is clickable, like everyhing else in Orange. What happens if you select one or more dots on the map?)
Remember something else that we have been doing while learning about classification. Can you remember a widget with which you can see how many houses of each kind you misclassified - and in which regions have you put them? (Hint: it is a widget that shows a kind of table, but not the Data Table widget.)
Pick one or more houses thay you misclassified. Which groups classified it/them correctly? (Hint: use the output Filtered data, not the default output.)

Clustering

The initial version of this activity was meant to teach clustering: students would be given a set of cards and their task would be to group them according to their appearance. We have only tested it once so far, with children who already knew the differences, so we turned their manual task into classification. On the computer, we showed both clustering and classification. Let us begin with the former.

Take the data directly from the File Widget. Use Image Embedder to construct vector representation of images. Compute Distances; the Cosine distance would be the appropriate distance type for embeddings. Finish the flow with Hierarchical Clustering. In Hiearchical Clustering, set the annotation to house type - the actual type of the house.

If you split the data into four clusters, three of them contain houses from three groups (with perhaps a few houses that don't belong there), and the fourth cluster is a mixture of two groups. Which groups doesn't have its own cluster?

There is a Karstian house in the Alpine cluster. From which town?

There are many ways to find the answer. One is to connect a Geo Map to the Dendrogram, and a Data Table to Geo Map. In the histogram, we cannot click on individual instances, but we can select a pair or the entire cluster of alpine houses. We can observe the houses in Geo Map: it's the red dot (red points are Karstian houses) on the south. We click this house and find its place in the Data Table.

alt text

Nicer still: connect an Image Viewer and set Title attribute to Place. With this, you will see the house (or houses) that you select in the map, as well as their names (or regions, or your classification, if you change the Title attribute).

alt text

The mistake is interesting: with the wooden fence in the first floor one could easily mistake it for an alpine house. But the roof is not steep enough, and the pillars as well as the stone stairs are a giveaway. Still, a reasonable mistake by an algorithm that has never been told the nuances that even most Slovenians don't know about.

Check: have you classified this house correctly? What about other groups?

The Pannonian house misclassified as an alpine house is also interesting. I misclassified it, too: I overlooked the straw roof, and with its wooden upper floor it looked like a somewhat comfier alpine house. Situated in Rogatec, it is separated from Pannonian plane by the last leg of Karavanks (fun fact for Slovenians: Boč and Donačka gora are part of Karavanks). Yet, the house is certainly far from the Alps, and the straw roof is a dead giveaway.

Classification

Let us see how much we remember about classification.

Train the logistic regression and clasification trees on this data. (For this task, use trees like we did for the zoo and quadrilaterals: disable all pruning options.) Then use the Predictions widget to see how well they classify the data. Both algorithms have a comparable classification accuracy. In which interval is it?

The above result is much much less amazing than it seems. The algorithm is classifying houses that it has seen before: if you are shown an image of a house and you are told that it is alpine, and if you have a memory of a computer, you will be able to classify it correctly. The question is, how well will you do with a new house? This is what we will test now.

We shall use a widget called Test and Score. Its input is the data (same as above) and learning algorithms. Widgets like Classification tree and Logistic regression have two outputs (and some of them have a few more). They are called Model and Learner. The Model output gives a constructed model, like a tree. The Predictions widget uses the model to make and show predictions. The other output, Learner, does not give a trained model but a recipe for building a model. The Tree widget outputs a recipe for building trees (with all the settings we choose in the widget), and the Logistic Regression widget outputs a recipe for building logistic regression models.

The Test and Score widget uses these recipes to first make a model and then test it. But unlike what we have done in Predictions, Test and Score splits the data into training and testing set. The former is used to build a model and the latter to test it. The data can be split in different fashions; the widget either splits it once, with the prescribed proportion of the data going to the training set, or it splits it multiple times. The latter is called cross-validation. One form of cross validation would split the data into 5 or 10 parts; each part would be used as a testing set once, and the rest as a training set. The results would be averaged. An extreme form of cross-validation is called leave-one-out: each instance is used as a testing set once, and the rest as a training set. This is very slow, but it is also very accurate.

The data is not very large and we have time: let us use leave-one-out cross-validation.

How would you comment the results?

The problem is that with sooooo many variables at hand (each image is described by 2048 numeric variables) the tree induction algorithm will easily find some random thresholds that will perfectly separate the training data. But these thresholds will not work on new data. This is called overfitting. The algorithm has learned the training data by heart, but it has not learned the underlying structure. This effect is even stronger when all checks for overfitting are turned off, as we did.

Therefore, when the model is tested on the same data it was build on, it will work perfectly (if the model has the capacity to remember the data and the data is suitable for this). That's why the tree had a perfect classification accuracy on the training data.

When it was tested on new data, it was a flop. The information about the content of the image is split across all 2048 variables, while trees, which split the data into smaller chunks at each step, run out of data well before they use all the necessary variables.

Not so for logistic regression. Logistic regresssion computes a combination, a sum of all available variables. It can use them all. Furthermore, the image embeddings that we use are designed to be good for logistic regression: a neural network is essentially a network of logistic regressions, so it is only natural that we put a logistic regression at the end of it.

Connect a Confusion Matrix to the Test and Score widget. In the left-hand part, choose Logistic regression. Does the widget reminds you of something? Do you understand what it tells you?

What is the most common mistake of logistic regression?

Can you find the two misclassified houses? (Hint: Confusion Matrix is clickable, too.) They are in places whose initial letter is ...

You can click on other off-diagonal cells. Or click Select Misclassified and observe all misclassified houses. Did Logistic regression make the same mistakes as you, or have you classified these houses correctly?

Who made more mistakes? You or logistic regression?

Conclusion

First, consider the difference between your and computer's classification task. You based your on a set of given rules. Computer's task was to find the rules. In lessons on classification trees and logistic regression we were able to check the model manually — does the tree make sense? Here, images are encoded using deep neural networks, so we cannot interpret the 2048 features used by the model and, consequently, the model itself. To prove the model is "correct", we tried it on new data.

This is one of the newest lesson plans. We are still developing it, but we included it here because it nicely ties everything that we learned together. We have tested it with a single group of children, but the hour was really enjoyable for all of us. It integrated well with their curriculum, but also led to quite a deep discussion about AI.

We discussed the difference between clustering and classification. Asked which problem would be easier for the computer, grouping houses or classifying them, they mostly voted for the former, but some smarter students understood the difference and said that in the first case the computed would not know what to look for. This insight is amazing for 11-years old kids.

We explained how deep neural networks make the model impossible to interpret, as we explained above.

Finally, at one point the students use the term that computer knows this and that, so we challenged them whether a machine can indeed know anything in any stronger sense than an English-Slovenian dictionary knows how to say cat in English. Need I stress again how amazing it is to have such a discussion with a classroom of 11-years old kid?