Lesson 2

Document Vectorization

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Chapter 1: Bag of Words

Text is a complex data form, that cannot be used for machine learning in its raw form. Hence we need to convert text into some the computer can work with. Such as numbers.

The first step was text preprocessing, where we prepared core units of our analysis - tokens. Next, we need to describe documents with numbers. Ideally, such numbers that would sum up the content of the document. A simple approach to describing a document with numbers is called bag-of-words.

Bag-of-words is a text vectorization approach which takes the tokens, identified in preprocessing, and computes their occurrence in documents. If our tokens are words, then the new numeric columns will be words, with their values the number of times the word appears in the document.

Document	this	is	an	example	another	apple
This is an example	1	1	1	1	0	0
Another example	0	0	0	1	1	0
This is another apple	1	1	0	0	1	1

A more elegant approach to computing word frequencies is term frequency - inverse document requency (TF-IDF), which decreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

$TF = (occurences\ of\ word\ in\ doc)$ and $IDF = {log (number\ of\ docs) \over (docs\ that\ contain\ word)}$

TF-IDF measure is the product of the two, $TF-IDF = TF \times IDF$ .

We count the words (TF or term frequency) or weigh the words according to how often they appear in the documents (IDF or inverse document frequency). Using TF-IDF, common words will have a low value as they appear across most documents, while significant words will have a high value because they appear frequently in a small number of documents.

Pass the data through a Bag of Words widget and then again to a Data Table. We get a new column that contains word counts for each document. Now that we have numbers, we can finally perform some magic!

Chapter 2: Word Embedding

While bag-of-words is a very simple vectorization technique, there are more advanced ones, too. They are called document embeddings and they are based on pre-trained models. These models take tokens and embed each word separately. To embed means to assign a vector, which corresponds to the semantic position of a word in a data space. In other words, semantically similar words will lie closer together, i.e. the word "orange" will be close to "apple", but also close to "fruit".

Word embedding is a first step in document vectorization. A word embedding is a learned representation for text where words that have the same meaning have a similar representation. Importantly, a vector is a dense representation, meaning that there are no missing or zero values in there, as happens with the bag of words.

word2vec

The first word embedding model was word2vec, published in 2013 by Mikolov et al. There are two underlying methods in word2vec, CBOW and Skipgram.

CBOW

CBOW stands for continuous bag-of-words and its goal is to predict the target word based on surrounding words. For example, in a sentence "I like X weather.", we would predict X based on the tokens "I", "like", "weather", and ".". Note that the method does not consider the order of surrounding words, thus the BOW label. The approach is fast and works well with frequent words.

Skipgram

Skipgram, on the other hand, predicts surrounding words based on the given word. For the sentence "I like sunny weather.", we would thus keep only "sunny" and predict the context. The method works well for small datasets and less frequent words. Why? Because even if a word occurrs infrequently, each occurrence contributes to its vector. CBOW, on the other hand, averages across contexts to predict the target word, thus creating a weaker (less representative) vector.

GloVe

GloVe is another word embedding approach, that, unlike word2vec, is trained on a co-occurrence matrix. A co-occurrence matrix counts how often the two words appear together, namely in how many documents. The matrix enable the algorithm to elicit connections between word pairs. If two word pairs frequently appear together, their vectors would also like close to each other.

For example, we have a small corpus of documents:

I like sunny weather.
I like oranges.
I enjoy flying.

Here is a co-occurrence matrix. Note that the matrix is symmetric.

fastText

Like word2vec and GloVe, fastText provides word embeddings. However, it does not work with CBOW/Skipgram or co-occurrence matrices. Instead, it computes bags of character n-grams, effectively capturing subword information. This makes it particularly good for handling low-resource languages and out-of-vocabularly (OOV) words.

ELMo

ELMo, or Embeddings from Language Models, also computes word embeddings. It uses bi-directional long short term memory (LSTM) network to produce contextual embeddings. LSTM is a recurrent neural network that models sequences and can remember long-term dependencies. A typical LSTM runs from left to right, while a bi-directional LSTM runs in both directions (one left and one right). Bi-directional nature effectively models what came before the target word and what comes after. ELMo will produce different embeddings for the same word given its context, thus effectively handling polysemy. For example, it can distinguish the word "culture" in the two example below:

"The blood sample was sent to the lab for culture to identify the bacterial infection."
"Understanding a patient's culture is essential for providing culturally competent care."

The words "blood", "lab", and "bacterial" affect the vector for the word "culture" in the first example, thus suggesting its biological use instead of a social one. ELMo can also handle OOV words due to its character-level input (similar to fastText). However, it is computationally slower than fastText.

Transformers

In 2018, there was a big breakthrough in natural language processing. A group at Google published BERT, a model based on the Transformer architecture. Transformers use the attention mechanism, which is an additional layer in the deep neural network that models aspects of the language, for example grammar, meaning, or temporal aspects. Besides attention, the underlying network is bi-directional, just like in ELMo, which means BERT can model context to the left and right of the target word. BERT is the state-of-the-art approach to language modeling, however, it is very computationally expensive.

Chapter 3: Document Embedding

Word embeddings are great! However, one document is usually composed of many words and each word is its own vector. So we need to find a way to aggregate word vectors into a single document vector. Usually, these would be performed by averaging all word vectors of a document. This can be done with any word embedding approach.

Some modern approaches also work as sentence embedders, which embed entire sentences instead of words. This gives a more precise representation of a document, since the words in the sentence are properly weighted according to their grammatical role. A good place to find sentence embedders is the HuggingFace repository. BERT also has an extension called SBERT (Sentence BERT), which outputs sentence instead of word embeddings.

Large language models

Large language models, such as GPT, are models that learn text representations and transfer them to various tasks. They are typically based on the Transformer architecture and trained to predict the next word. Generally speaking, the texts are first preprocessed (tokens, n-grams), embedded, and pre-trained to first understand the language. They use attention mechanisms to capture various aspects of the language. Then, they are fine-tuned on a specific task, for example question-answering for a chatbot.

Comparison with BOW

The advantages of word embedding approaches are that they:

capture word meaning and existing relationships between words
can better represent smaller corpora
overcome the issue of sparseness in BOW
require little preprocessing
are constant vector size.

On the other hand, they are not explainable, are domain dependent, mostly cannot handle out-of-vocabulary words and are language dependent (novel approaches implement also general langauge models, such as Multitext BERT).

In contrast, the advantages of bag-of-words are:

they are explainable, interpretable
represent word importance
are language-independent.

However, they are also sparse and increasinly wide. Moreover, they are missing the context in which each word appears.