Friday: Text Mining

Text embedding, clustering and classification for finding related words and articles and classifying new newspaper articles

We have designed this course for prospective STEM teachers. These working notes include Orange workflows and visualizations we will construct during the lectures. Throughout our training, you will see how to accomplish various data mining tasks through visual programming and use Orange to build visual data mining workflows. Many similar data mining environments exist, but the lecturers prefer Orange for one simple reason—they are its authors.

These course notes were prepared by Blaž Zupan and Janez Demšar. Special thanks to Ajda Pretnar Žagar for the earlier version of the material. Thanks to Alana Newel, Nancy Moreno, and Gad Shaulsky for all the help with organization and the venue of the course. We would like to acknowledge all the help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia.

The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Mining of Text and Words

Mining of Text and Words

Data Collection

Create a training data set consisting of about 200 words of your choice. The more the merrier. Get a list of words from ChatGPT or any large language model. Copy the list to Excel, one word in a row. If needed, use Excel (or similar spreadsheet application) to extract only the words (e.g., remove the numbers in front of words, if there are any). Use ChatGPT to help you when designing a formula to do so. Every column in the data table should start with a header (say, "ids" and "words"). Store the Excel file in some convenient location.
Open Orange and install Text add-on.
Read your data file in Orange. Use Select Columns to select only the column with words. Inspect the data in the Data Table. We will call this data our training set data.

Embed the Data Into Vector Space

To proceed have to represent our data, that is, our words, with numbers. These vector-based representation will serve us when computing the distances between words and finding similar words.

For some strange reason of software engineering, Orange text mining requires that you push the text data through the Corpus widget. Well, not only software engineering: Corpus widget let's you choose which of the text-based features will characterize the data instance and be used for text mining. Since we have only one text feature in our data, there is no other choice than the use the only one that we have, yet we still need to use Corpus widget to mark that we will use that one. Do this, and check the output in either Data Table or Corpus View, or, best, both.
Embed the words in the vector space by feeding the output of the Corpus Viewer to Document Embedding. Use fastText embedding, as it is more suited for words. Check the output of Document Embedding in the Data Table. What is the size of the vectors that are representing the words?

Clustering

Cluster the words. We will use hierarchical clustering, and have thus to assess pairwise distances first. The Distances widget performs this operation. Embedding vectors are quite long and we should use cosine distance to measure the similarity between our data items (words). Feed the resulting distance matrix into Hierarchical Clustering and there use the Ward linkage. Hierarchical Clustering shows the dendrogram; annotate its leafs with words, and comment the clustering results. Are semantically similar words indeed clustered together? Comment on resulting word groups.

Finding Similar or Semantically Related Words

Here, we would like to start with another list of words, called reference words, each time choosing one word from the reference list and find semantically related words from our training set of words we have used above. Say, if our training set includes objects from daily life, and our reference word is "relax", we would hope to find words like "chair", "couch", or "bed" from our training list of words.

Open Excel (or similar spreadsheet application), use "words" for the header of the column, and create a reference list words in the first column. Save this file as "reference.xlsx" to your desktop.
Use the File widget to read the "reference.xlsx", check if everything is ok in the Data Table, push the data from the File widget through the Corpus, and embed it in the vector space. To use comparable embedding to the one used for the training set, make sure you chose fastText for the embedding method.
Find words from the training set that are related to a refernce word. Neighbor widget. Make sure you first connect it to the embedded training words data, so that its Data input signal comes from the training data set. We would like to chose just one data item, that is, one word, from the set of reference words. You can use Data Table or Corpus Viewer to select the embedded reference data item. Select the word by clicking on the corresponding row in one of these two widget, whichever you have used. The selection should now be sent as a reference to the Neighbor widget.
In the Neighbors widget, make sure you use cosine distance, and set the number of neighbors to some low number, say, 3.
Connect the Neighbors widget to Corpus Viewer to view the results. Ideally, you would now have open both the widgets where you select the reference word, and the Corpus Viewer with result from the Neighbor widget.
Select different reference words and check the results. Do the results make sense? If you would want to include some other words in the reference set, go back to Excel, add the words, and reload the data with opening the File widget and clicking the "Reload" button.

Clustering of News Articles

Reuse parts of the workflow that you have constructed above, but use the data on news articles (provided by your instructor and temporary available as news-articles.xlsx:

Instead of the training set of words, load the file with the articles.
In the Corpus widget, you may want to use both standFirst and bodyText as text features (standfirst is the opening line in the articles, bodyText the main text of the article).
Embed the data with multilingual SBERT, which works with a most common languages and was designed to embed sentences and larger text sequences.
Instead of hierarchical clustering, where the dendrogram visualization may suffer in readability because of too many articles in this data set, you may consider feeding the data into the t-SNE widget. There, color the dots that represent articles according to sectioName. Explore your selections in the t-SNE visualizations in the Corpus Viewer.
In place of the file that reads the reference words and its corresponding Corpus widget, you may use Create Corpus. Make sure that embedding of the reference is also set to multilingual SBERT. There, you can type or paste in any text you would like to use as a reference. Say, "artificial intelligence", or even longer text sequences. You may want to find a new news article on the web, and copy and paste a selected paragraph to find if any of the news from our training set is similar in content.

Observe the results, experiment with various reference texts, and think about the appropriateness of using SBERT embeddings in finding clusters of articles and related news articles.