Lesson 1

Text Preprocessing

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Chapter 1: Introduction to Text Mining

Text mining is a computational analysis of texts. It uses statistics, natural language processing, and machine learning to extract information from documents.

Installing the Text add-on

You will need the Text add-on, which introduces components for text preprocessing, analytics, visualization and deep-learning-based embeddings to Orange. To install Text add-on, go to Options --> Add-ons and select Text from the list. You will have to restart Orange for Text widgets to appear.

A new pane with widgets from the Text add-on will appear on the left side of the canvas.

Loading corpora

A collection of text documents is called a corpus. Widget for loading corpora is called Corpus. We will use the grimm-tales-selected corpus, which you can select from the drop-down menu in the widget. The corpus contains 44 folk tales, collected by the Grimm brothers.

The widget is very similar to the File widget with one important distinction. Orange needs to know which attribute contains the content of the documents. The Grimm corpus has the attribute Content, which is already placed in the *Used text features * section. Alternatively, drag one or more attributes you would like to use for text mining to the box on the left.

Chapter 2: Text Preprocessing

One of the first steps, if not the first, when working with text data is to preprocess the data. This means defining the core units of the analysis. While in standard data mining we are usually already working with tabular data, where the instances are in rows and they are described with some features, with text this is often not the case. So we need some steps to prepare our texts for downstream analysis and this is called preprocessing.

Terminology

token: a unit of analysis in text mining, usually a word, but it can also be a character, a sentence, a phrase
lemma: a base form of the word, i.e. the lemma of cried is cry
stem: the root of the word, i.e. the stem of cried is cri
POS tag: a part-of-speech (POS) tag that defines the type of the word, for example cry_VERB.
n-gram: a token of length N, so a bigram is a token of length two, i.e. inconsolable crying, a trigram is a token of length three, i.e. tears of joy

Example

Connect Word Cloud to Corpus. The visualization shows words, with their size corresponding to their frequency in the corpus. In this case, Word Cloud simply displays all the words and symbols found in the text. But this is often not what we want. We want to extract only meaningful units, such as semantically rich words. This is why we need text preprocessing.

But the word cloud we are looking at is a mess! We got a bunch of semantic junk in our visualization. Is there a way to clean this up?

Of course. We can achieve this with the Preprocess Text widget.

Preprocessing is executed sequentially. We start by lowercasing the text. This means words will be treated the same regardless of whether they appear at the beginning or middle of the sentence. However, words such as "apple" (a fruit) and "Apple" (a company) will also be treated the same, which is not always desirable.

Next, the text is split into tokens, which are the core units of analysis. They are usually words, but they can also be sentences, bigrams, and so on. The default option, Regexp keeps only words, omitting the punctuation.

We also removed redundant words. As we saw in the word cloud above, the most frequent words in English texts are "the", "and", "of", and so on. While these words are important for syntax, they do not carry any meaning, so they are often omitted from the analysis.

The order of preprocessing steps is crucial, because each step will have the results of the previous step as its input. The usual preprocessing pipeline would be:

transformation, which converts the text to a desired form, i.e. lowercase
tokenization, which creates core analytical units from the text. Typically, this means splitting the text into words, sometimes also omitting punctuation.
normalization, which transforms the tokens into their base form, i.e. their lemmas
filtering, which removes undesired tokens, usually stopwords

Other options are:

n-grams, which creates tokens of a desired length
POS tags, which tags tokens with corresponding part-of-speech tags

We see the results of our preprocessing in the Word Cloud. Two of the most frequent words are "would" and "could". If we decide these two words are not important for our analysis, it would be good to omit them. We can do this with custom filtering. filtered out some stopwords. But perhaps filtering out generic stopwords is not enough for our analysis.

To sum up, we transformed all words to lowercase, treated each word as a token (and omit punctuation), and removed the stopwords (such as "in", "and", and "the"). This preprocessing outputs the following tokens:

"This is a sample sentence." --> "sample", "sentence"

A good plain text editor is Sublime, but you can easily work with Notepad++.

We can always load our own custom stopword list. Open a plain text editor and create a custom list of stopwords. Write each new word on its own line and save the file.

Load the list of custom stopwords in the right-hand dropdown of the Filtering section.

Another preprocessing technique is to filter out words that are too rare and too frequent. Rare words are normally found in only a few documents and frequent words are likely stopwords or very general words. To retain only those words that truly represent the corpus and may distinguish between corpus documents, we use Document frequency filter with Relative frequencies. If we set the values to 0.1 and 0.9, we will retain only those words that appear in more than 10% of the documents and in fewer than 90%.

Preprocessing is really the key to a successful text analysis. We have only mentioned a few techniques, but you can experiment on your own with the following ones:

normalization transforms all words into lemmas or stems (for example sons to son)
n-grams are tokens of larger size, bigrams (a pair of consecutive words) and trigrams (word triplets)
POS tagging tags each token with a corresponding part-of-speech tag (sons --> noun, plural, tag = NNS)

Chapter 3: Concordances

We have prepared our corpus and now it is time to visualize it.

Visualizations in Orange are designed to support selection and passing of the data that applies to it. Finding interesting data subsets and analyzing their similarities is a central part of exploratory data analysis.

We have already seen some of the preprocessing results in a word cloud. But we still don't know much about the use of a specific word in a text. Since we lowercased the text, there might be some conflation. For example 'oh' could be a lowercase version of OH (the chemical compound of hydroxide), a simple exclamation 'Oh!' or an abbreviation for the state of Ohio.

To inspect the documents containing a particular word, select the documents in Concordance and pass them to Corpus Viewer for a deeper analysis.

To check the context of a particular word we can use Concordance widget. Concordance shows us the text around our word.

Connect Concordance to Corpus to pass the text to the widget. To browse the word, type it in the query line at the top or provide it with the Word Cloud. Here we have selected the word 'king' in the Word Cloud and observed the context in Concordance.

Chapter 4: Collocations

Collocations are sequences of words or terms that frequently co-occur together in a text, often with a particular meaning or significance that goes beyond the individual meanings of the words. In other words, collocations are words that tend to appear together more often than would be expected by chance alone.

For example, artificial intelligence, European Union, and fast food are all examples of collocations. These word combinations have become established in the language due to their frequent usage and convey specific meanings or concepts.

Some of the most significant collocations in the corpus of Grimm's tales are Herr Korbes (a protagonist), click clack (a sound of the mill) and wild beasts (from Hansel and Gretel). Note that we set the threshold at 5 — the collocation has to appear at least five times in the corpus.

There are several measures used to compute collocations:

Pointwise Mutual Information (PMI): PMI measures the association between two words based on their co-occurrence in a corpus. Collocations with high PMI scores indicate strong associations between the words and are likely to be meaningful combinations.
Log-Likelihood Ratio (LLR): LLR compares the likelihood of observing a word pair in a corpus to the likelihood of observing the same words independently. It is often used to identify statistically significant collocations by comparing their observed frequency to the expected frequency under a null hypothesis of independence.
Frequency-based Measures: These measures assess the frequency of occurrence of word pairs in a corpus and compare it to the expected frequency under a random distribution. Collocations with higher observed frequencies than expected are considered significant.
Dice's Coefficient: Dice's coefficient measures the similarity between two sets by calculating the ratio of twice the intersection of the sets to the sum of the sizes of the sets. In the context of collocations, it compares the co-occurrence of words in a pair to the total occurrences of each word individually.

Pointwise Mutual Information

Pointwise Mutual Information (PMI) is a statistical measure that quantifies the association between two words based on their co-occurrence in a corpus. It compares the probability of observing the two words together in a document to the probability of observing each word independently.

The formula for PMI is as follows:

$PMI(w1,w2)=log(\frac{P(w1,w2)}{P(w1)×P(w2)})$

Where:

$w1$ and $w2$ are two words.
$P(w1,w2)$ is the probability of co-occurrence of $w1$ and $w2$ in the corpus.
$P(w1)$ and $P(w2)$ are the probabilities of observing $w1$ and $w2$ independently in the corpus.

The PMI value tells us how much more likely it is to observe the two words together than we would expect by chance.

Key points about PMI:

Positive Values: If $P(w1,w2)$ is greater than $P(w1)×P(w2)$ , PMI will be positive, indicating that the words occur together more frequently than expected by chance. This suggests a positive association or dependency between the words.
Negative Values: If $P(w1,w2)$ is less than $P(w1)×P(w2)$ , PMI will be negative, indicating that the words occur together less frequently than expected by chance. This suggests a negative association or dependency between the words.
Symmetry: PMI is symmetric, meaning that $PMI(w1,w2)=PMI(w2,w1)$ . It measures the association between the two words regardless of their order.

PMI is commonly used in natural language processing tasks such as information retrieval, text mining, and topic modeling to capture the semantic relationship between words based on their co-occurrence patterns in a corpus. It helps identify meaningful associations between words that are likely to convey similar or related concepts.