Text Mining
Lesson 7: Semantic Analysis
These lecture notes are designed as accompanying material for the xAIM master's course. The course provides a gentle introduction to natural language processing, text mining, and text analysis. Students will learn how to accomplish various text-related data mining tasks through visual programming, using Orange as our tool of choice for this course.
The notes were prepared by Ajda Pretnar Žagar and Blaž Zupan. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.
The material is offered under Create Commons CC BY-NC-ND licence.
Chapter 1: Keyword Extraction
Keyword extraction refers to finding representative words and phrases in a document or a corpus. There are several keyword extraction techniques. We will discuss TF-IDF, RAKE and YAKE!
TF-IDF
TF-IDF keyword extraction uses the TF-IDF transform to retrieve relevant words. The idea is based on the IDF transform, that inversely weighs word frequencies based on their document frequency. Words that appear in all documents will thus have a lower weight than those, that appear frequently in a small subset of documents.
While TF-IDF usually performs well, it is influenced by document length.
RAKE
RAKE keyword extraction is based on word frequencies and co-occurrence matrix. It uses stopwords as delimiters, then finds candidate phrases, computes co-occurrence matrix, and uses it to determine keyword ranking.
Rake is more suitable for finding individual words and does not consider semantic relations. It is relatively fast and efficient.
YAKE!
YAKE! is based on statistical properties of words, their position, and co-occurrence matrix. It first preprocesses text, performs feature extraction (considers casing, word position, word frequency, and context words), scores words, deduplicates the results, and ranks them.
YAKE! is suitable for finding phrases and considers semantics. Compared to RAKE, it is slower to compute.
Example
We will use Grimm-tales-selected. Load the data in the Corpus widget and perform preprocessing with Preprocess Text. Then connect Extract Keywords to Preprocess Text.
The figure below shows the results ranked by TF-IDF. Filter by the other two methods and observe how the results change.
Chapter 2: Semantic Analysis
Semantic analysis of text means uncovering meanings in unstructured text and organizing the documents in such a way that conceptually similar documents lie close together. This is something we already explored with document clustering, but here we will go a bit deeper. We will learn how to construct annotated document maps.
We are using PMC-Patients-children, which is a subset including only child patients from the PMC data set of patient notes extracted from PubMed Central. We load the data with the File widget and pass it to Corpus, where we set "patient" variable as the text variable. We preprocess the data and observe it in a word cloud.
We see that the notes refer to patient(s), days (since admission), normal (results), blood (results) and so on. This is a nice general overview of the data, but it would be better if we could organize the data in a more concise way. For example in a map, where similar documents would lie close to one another.
Thus we prepare document vectors with bag-of-words and embed documents in a 2D space with t-SNE. Finally, we pass the data to Annotated Corpus Map, in which we, once again, set the t-SNE projection and use Gaussian mixture models to determine clusters.
Looks like the data contains roughly 5 clusters on pulmonary disease, lab results, ageing, eye disease, and cancer. Remember, the data refers only to child patients.
Chapter 3: Semantic Search
Finding key terms in the dataset used to be quite laborsome. One would have to read all the documents and find keywords by hand. If the documents were digitised, one could, of course, also search for a given keyword. But then one had to know regular expressions to find all possible occurrences of a word (i.e. work, working, worker). Ideally, one would simply list some key terms and ask the computer to find everything related to these terms in the text. This is a task called semantic search.
We are using mtsamples dataset, a subset of medical transcripts from three domains - Cardiovascular/Pulmonary, Neurology, and Orthopedic. The domains are not important for this task — we'd simply like to find terms of interest and where they appear in the text. We load the data with the Datasets widget and pass it to Corpus, where we set "transcription" variable as the text variable.
Next, we preprocess the data. We keep the default settings, then add Lemmagen lemmatization for English between tokenization and filtering. Lemmatization converts tokens to their base form. Finally, we remove all the tokens that appear in less than 10 documents. Set the Absolute value in Document frequency to 10 - max.
Now, we prepare the list of keywords we're interested in. Say, "ostheoporosis", "knee", and "joint". We write the words in the Word List widget. Connect Preprocess Text to Semantic Viewer. Then, add the Words output from Word List to Semantic Viewer as well. The widget will use the Words input to compute sentence-level similarity to keywords based on their embeddings. It will return similarity scores and matches.
Sort the documents in Semantic Viewer by Match count. This way we can observe the documents with the most matches to the provided keywords. Degenerative joint disease? Sounds like this is the kind of document we'd like to inspect further.