HomeText Mining
cover image

Text Mining

Lesson 6: Sentiment Analysis

These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.

The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapter 1: Sentiment Analysis

Sentiment analysis (or opinion mining) is a task of extracting sentiment from text data. Sentiment comprises the opinion holder (i.e. reviewer) + (time of event) + sentiment target (product, movie, service...) + sentiment (positive, negative, neutral). Furthermore, we are interested in polarity (+/-/0), intensity (high, medium, low), and/or specific emotion (fear, anger, joy, surprise, disgust).

There are three approaches to sentiment extraction:

  • lexicon-based
  • machine learning
  • hybrid

Lexicon-based approaches

Lexicon-based approaches use scores to relate words to sentiment. A lexicon has words in rows with a score, typically a real number between 1 and -1, in a column. Alternatively, there are two lists, one with positive and one with negative words.

Lexicon-based approaches use rule-based techniques that extract opinion words and classify the document by averaging the polarity of all matched terms. Some approaches, such as the one proposed by Liu and Hu (2004), use only polarity averaging. Later approaches, such as VADER (Hutto and Gilbert, 2014), contain rules that address negation, intensity, and emoticons.

The advantage of lexicon-based approaches is that no training data is needed and they yield results quickly. The disadvantage is that they are heavily depended on vocabulary.

Machine learning approaches

  1. Unsupervised methods: word co-occurrence, word frequencies
  2. Supervised methods: models trained from labelled corpora
  3. Deep learning methods: DNN, CNN, LSTM

The advantage of ML approaches is that they are trained on specific data and can capture medical context better. The disadvantage is that they require a lot of data for training and they are somewhat black box (especially deep learning methods).

Hybrid approaches

ML approaches supported with lexicon data.

Sentiment analysis is useful in many different situations. It can be used to explore brand sentiment, observe story arcs, recognize hate speech, and so on. Orange uses simple lexicon-based methods to compute the sentiment of the text.

Example with lexicon-based models

For this example, we'll load the election-tweets-2016 into the Corpus widget. The data is a collection of tweets from Donald Trump and Hillary Clinton during the 2016 USA presidential elections.

Sentiment scores are computed in the Sentiment Analysis widget. We will use the default Vader method, but feel free to try the others. In Select Columns, we will keep only the sentiment scores and put all the other features in meta attributes.

A nice visualization for observing several numeric values at once is Heat Map. Heat map shows documents in rows and attributes in columns, with the color of the field corresponding to its value. In this case, high values are yellow/white and low values are blue.

But our data is completely unorganized and it is difficult to make sense of the visualization. Hence we will use some tricks to make it interpretable. First, we will join similar rows into a single row. We will do this with k-Means and keep only 50 rows.

Now, our visualization is much more compact. But preferrably, we would sort the rows in some logical order. For this, we will use Clustering (opt. ordering) setting. It will use hierarchical clustering to put similar rows together and optimally place the leaves of the dendrogram.

The visualization is finally interpretable. At the top, we see positive documents, while in the bottom, the negative ones. Select the most positive documents by selecting the top leaf of the dendrogram. Then, observe the output in Corpus Viewer.

Looks like our selection indeed contains positive tweets!