Lesson 8

Document Networks

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Document Networks

Chapter 1: Document Networks

Co-occurrence networks are a popular computational linguistic method. They aim to discover how words co-occur across a corpus.

A network has nodes and edges. Nodes are data items, while edges connect nodes based on a specified condition. For example, a node can be a person. An edge is created, if the two people know one another.

Nodes in co-occurrence networks are words. An edge is constructed, if a pair of words appear in the same document.

In Orange, co-occurrence networks can be constructed using Corpus to Network wigdet. To explore the network, one has to install a Network add-on (Options --> Add-ons).

For this example, we are using the grimm-tales-selected corpus of 44 tales from the brothers Grimm. We are passing the data to Preprocess Text with default preprocessing (+ Lemmagen lemmatization before Filtering). Then we pass the data to Corpus to Network.

Corpus to Network widget can construct two types of networks, a document network and a word network. In a document network, nodes are documents and edges are constructed if a pair of documents share more then N words. In a word network, nodes are words and edges are constructed if a pair of words appear in more than N documents.

First, set Node type to Word. There are many parameters here. Threshold defines in a least how many documents the word pair has to appear for an edge to be created. Windows size defines how close the two words have to be in a text. For example, window size 5 means the word B has to be within 5 words to the left or to the right of word A. Frequency threshold means how frequent a given word has to be in order to appear as a node.

In our case, we want words to appears in at least 10 document within a window size of 5 and a word has to appear at least 10 times in the corpus. We get 600 nodes (words) and 754 edges.

We can display the co-occurrence network with Network Explorer. Make sure to connect both the Network and Node Data outputs.

The network is not very easy to interpret. This is because many words appear as nodes (they are above a specified frequency threshold), but they don't have any edges (are not connected to other words based on our threshold). We can select the connected network and have a closer look in another Network Explorer widget. Once again, make sure to connect both outputs.

This is a little better. Since we provided the node data, we can set the color and the size of the nodes to word_frequency. This will highlight the most frequent words in the network.

Zooming in, we see that verbs take center stage — they are the binding units of a narrative. In between these verbs is the word "king". This means kings are central actors that drive the narrative in many tales.

For a more in-depth analysis, we could add information on degree centrality from the Network Analysis wigdet or network clustering from the Network Clustering widget .