Lesson 3

Document Classification

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Chapter 1: Document Classification

Aarne and Thompson were two folklorists, who invented and perfected the motif-based classification system of folk tales. This system has been in place since 1910 and is commonly used in comparative folkloristics. The final U in ATU stands for Uther, who was the last to update the index in 2004.

Grimm's tales corpus contains the attribute for Aarne-Thompson type (ATU). This is the index of folk-tale motifs and the tale have a high-level (genre) and a mid-level ATU type (subgenre).

Could we perhaps predict the ATU type based on the content of the tale? Let us see.

First, we need a target variable. This is the feature we are trying to predict, in our case an ATU type. We also need a numerical representation of each document - something we already have from the Bag of Words.

Now we will build a predictive model. A predictive model considers tokens (words) and predicts the target variable (ATU Topic). Every model also needs a learner, which is a method on how to consider the tokens. A commonly used classifier in text mining is Logistic Regression.

In Predictions, we can see a column with predicted values from Logistic Regression. Seems like our model got most of the tale types right.

Chapter 2: Logistic Regression

In the workflow above we have used logistic regression, a popular machine learning method. It is often used in text mining for its speed and predictive performance. How does it work?

This visualization is called a nomogram. It displays scores (votes) of each attribute for the selected target value at the top left.

It lets the words vote. For example, the word 'fox' in the text votes for the tale being an animal tale. So do cats and birds and wolves, but not so strongly (lines in the visualization are much shorter). We can see that the word 'fox' is the best clue for the text being an animal tale.

The word 'little' votes for the opposition. So does 'came' (see how they have zeros at the right end of the scale?). The more little things there are in a story, the less likely it is an animal tale.

In Nomogram, you can interactively observe the model's classification. Drag the blue dots left or right so that the accumulated sum of points (Total) is as high as possible.

Each word gives a score. If there are 29 foxes in the text, then the model will give it 3 points for being an animal tale. And if there are no words 'little', it will give it an additional 4.5 points.

Of course the real method is a bit more complicated - it tries to find appropriate vote weights and thresholds. But this involves some linear algebra and other scarier-than-wolves words, so let's not stroll down this path.

Chapter 3: Model Evaluation

We have observed the logistic model - foxes, birds, little things and all that. The voting schema looked reasonable. And before that, we have seen the model's predictions. They looked good. But how would we quantify the performance of the model?

Perhaps we should just compute the proportion of the stories for which the model gave the correct answer? This score is called classification accuracy. For example, if we correctly predicted 40 tales out of 44, the classification accuracy would be 40/44 or 91%.

AUC is another good measure to consider. In essence, the closer these two measures are to 1, the better the model's performance.

The widget that computes the classification accuracy is called Test and Score. It needs two inputs: the data to test the model on and the modeling algorithm.

It would make little sense asking whether Rapunzel is about animals after already telling the model that it is not. Didn't we do this above in Predictions? Indeed, and this is why the predictions there were so excellent. Models should never be tested on the data from which they were constructed.

This time, Logistic Regression doesn't need a data input. Instead, it provides a learner, which is a procedure for constructing the model. Test and Score then applies the learner multiple times on different data subsets. Each time, it constructs the model on a selected subset and uses the left-out data for testing. It would make little sense asking the model whether Rapunzel is an Animal Tale, after already telling it that it is not.

Chapter 4: Predictions

All we did so far is predict what we already knew - the tale type of each text. We had the right answers already in our ATU Topic column. But what if we don't have this information? Could we perhaps predict tale types for unlabelled texts?

Open a new Corpus widget and load the andersen.tab corpus. Here we have three tales from H. C. Andersen. Inspect them in Corpus Viewer and try to guess the tale type yourself.

Now connect them to Predictions the same way as before - with Logistic Regression passing the constructed model and the new Corpus widget passing the data for prediction. Do not forget to copy-paste the preprocessing widgets, namely Preprocess Text and Bag of Words to repeat data preparation.

Logistic Regression predicted all the tales to be Tales of Magic.

Hint: Think about what kind of data the model was trained on and what were the most important words for the model.

The Ugly Duckling as a Tale of Magic? Sounds strange! Why do you think this happens?