HomeText Mining
cover image

Text Mining

Lesson 2: Document Vectorization

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapter 1: Bag of Words

Text is a complex data form, that cannot be used for machine learning in its raw form. Hence we need to convert text into some the computer can work with. Such as numbers.

The first step was text preprocessing, where we prepared core units of our analysis - tokens. Next, we need to describe documents with numbers. Ideally, such numbers that would sum up the content of the document. A simple approach to describing a document with numbers is called bag-of-words.

Bag-of-words is a text vectorization approach which takes the tokens, identified in preprocessing, and computes their occurrence in documents. If our tokens are words, then the new numeric columns will be words, with their values the number of times the word appears in the document.

Documentthisisanexampleanotherapple
This is an example111100
Another example000110
This is another apple110011

A more elegant approach to computing word frequencies is term frequency - inverse document requency (TF-IDF), which decreases the count of words which appear frequently across all documents and increases the count for those that are significant for a small number of documents.

TF=(occurences of word in doc)TF = (occurences\ of\ word\ in\ doc) and IDF=log(number of docs)(docs that contain word)IDF = {log (number\ of\ docs) \over (docs\ that\ contain\ word)}

TF-IDF measure is the product of the two, TFIDF=TF×IDFTF-IDF = TF \times IDF.

We count the words (TF or term frequency) or weigh the words according to how often they appear in the documents (IDF or inverse document frequency). Using TF-IDF, common words will have a low value as they appear across most documents, while significant words will have a high value because they appear frequently in a small number of documents.

Pass the data through a Bag of Words widget and then again to a Data Table. We get a new column that contains word counts for each document. Now that we have numbers, we can finally perform some magic!

Chapter 2: Document Embedding

While bag-of-words is a very simple vectorization technique, there are more advanced ones, too. They are called document embedding and they are based on pre-trained models. These models take tokens and embed each word separately. To embed means to assign a vector, which corresponds to the semantic position of a word in a data space. In other words, semantically similar words will lie closer together, i.e. the word "orange" will be close to "apple", but also close to "fruit".

Word embedding is a first step in document vectorization. A word embedding is a learned representation for text where words that have the same meaning have a similar representation. However, one document is usually composed of many words and each word is its own vector. So we need to find a way to aggregate word vectors into a single document vector. Usually, these would be performed by averaging all word vectors of a document.

The advantages of word embedding approahches are that they capture word meaning and existing relationships between words, can better represent smaller corpora, overcome the issue of sparseness in BOW, require little preprocessing and are constant vector size. On the other hand, they are not explainable, are domain dependent, mostly cannot handle out-of-vocabulary words and are language dependent (novel approaches implement also general langauge models, such as Multitext BERT).

In contrast, the advantages of bag-of-words are that they are explainable, interpretable, represent word importance, and are language-independent. However, they are also sparse and increasinly wide. Moreover, they are missing the context in which each word appears.

One of the first word embedding approaches for English was Word2Vec, a model trained with a combination of continuous bag-of-words (CBOW) and SkipGram. For context, CBOW is a model that tries to predict the word from surrounding words, while skipgram is a model that tries to predict the surrounding words from the word. Word2Vec was later improved with Glove, which learns from a word co-occurrence matrix. Contemporary neural-network-based models are fastText, ELMo, and BERT. FastText learns from character sets and is good for unknown words. ELMo learns from context, which means words can have different embeddings based on surrounding words (a chair can be a piece of furniture or a person presiding over a committee). Both approaches handle polysemy (bear and to bear).