Survival Analysis with Visual Analytics

A Practical Guide to Survival Analysis with Visual Analytics in Orange

Welcome to the Survival Analysis with Visual Analytics tutorial, where you'll learn how to uncover meaningful patterns in time-to-event data. The tutorial aims to provide an accessible introduction to survival analysis, covering topics such as preprocessing survival data and conducting complex data analysis using visual programming. To facilitate the learning process, we will be using Orange, our data mining tool of choice for this course. If you have not already installed Orange, please visit its download page now.

These course notes were prepared by Ela Praznik, Jaka Kokošar, Blaž Zupan, and Martin Špendl with help from the members of the Bioinformatics Lab at University of Ljubljana, Slovenia. The material is offered under Create Commons CC BY-NC-ND licence.

Chapters

Chapter 1: Survival Data and Survival Curve

Understanding how long something lasts before an event occurs—be it a machine breaking down, a product failing, or a dental filling falling out—is at the heart of survival analysis. Unlike standard statistical methods that expect a complete outcome for every case, survival analysis is uniquely suited to handle incomplete information through the concept of censoring. This chapter introduces survival data, explains how to prepare it for analysis, and demonstrates how to visualize time-to-event outcomes using the survival curve. Through a relatable example involving dental fillings, we’ll explore how to handle censored observations and estimate survival probabilities with tools like the Kaplan-Meier estimator—both manually and using the Orange data mining platform. Following are main concepts that we will cover:

Survival Analysis: A way to study how long it takes for something to happen.
Event: The outcome we are observing, like a dental filling falling out.
Censoring: When the event hasn’t (yet) happened during the study or its timing is unknown.
Survival Time: The time from the beginning of observation until the event or censoring.
Kaplan-Meier Curve: A graph showing the chance of the event not happening over time.

1.1 Survival data

Survival analysis is a set of techniques used for modeling time to an event of interest.

In survival analysis, the primary outcome we are interested in is the time to an event of interest. The "survival" in survival analysis stems from its use in the study of the survival of patients. However, the event of interest can be many things: a disease relapse, the first technical failure of a car, or something as simple as the time until a newly acquired dental filling falls out. Many analysis methods would apply if the event occurred in all individuals. However, it is usual that at the end of a study, some individuals have yet to have the event of interest, and some might have ended their participation in the study beforehand for reasons other than experiencing the event. This doesn't mean that they won't necessarily experience the event in the future, but that their actual time to the event is unknown. This phenomenon is called censoring, and the data with an unknown true time to event is called censored data. Survival analysis includes a set of methods that can deal with datasets that include censored data.

Data with an unknown true time to event is called censored data.

Let's look at the example from the videolecture. We gathered from a group of 10 friends when they last got a dental filling in the last ten years and when it fell out if it did. Can we estimate the probability of a new dental filling remaining in place after five years?

Below we plotted the answers as a diagram. The x-axis marks the time from 2012 to 2022, and the lines represent when each person got their dental filling and how long it lasted. For instance, Bert got a dental filling in 2012, which lasted until 2015, and Fay got her's in 2014, which lasted until 2020. We have marked the friends whose dental filling fell out with a cross. However, two of the participants in this small study, Irene and Chloe, had their dental filling in place at the end of our observation window. And Harry and Fay did not lose their dental fillings, but for some reason or another, we do not know what happened to their dental fillings from 2020 to 2022. Perhaps they got them changed before the fillings had the chance to fall out. These four participants represent our censored data points. The time their dental fillings stayed in place still tells us something about how long fillings usually last. So instead of discarding their data, we mark them with a circle.

This data represents an example of right censoring, but we also know cases with left- and interval censoring. Left-censoring would mean that we observe the presence of a state or condition but do not know when it began. Interval censoring, on the other hand, means that individuals come in and out of observation. This tutorial focuses only on right-censoring since this is how most survival data is censored.

A minimal survival dataset is thus composed of observations with a survival time and event variable. The latter specifies if the event has, in fact, occurred (event=1) or whether it has been censored (event=0). We can transform our dental fillings data plotted as a diagram into a data table suitable for survival analysis. Since we are interested in how much time the dental filling lasted and not exactly what year it fell out, we re-plot the diagram, aligning when each person got their cavity filled to time 0.

We can now easily transform this diagram into a data table. We order the data instances by time so that Anthony - whose time to his filling falling out is the shortest - is first, Bert is second, followed by Chloe, and so on. The third column contains the data on event censoring. On the diagram, we've marked Anthony and Bert with a cross, which means their filling fell out. Under their names in the table, we input a 1. But Chloe is marked with a circle since her filling has yet to fall out by the end of 2022, so we input a 0. We do the same for others. We have successfully prepared the data for the application of survival analysis methods.

1.2 Survival curve

The survival function gives the probability of surviving past a particular time.

We now have our data nicely organized in a table. Let's return to the question that got us started: can we predict how likely a new dental filling will fall out after five years? Intuitively, the probability of this happening increases over time because minor damages to and around the dental filling accumulate as time goes by. These damages can be due to the type of filling and its interaction with the bacterial biofilm, your diet, saliva composition, mechanical forces, etc. However, when we visually represent such data, we want to plot the probability of the event not happening. In this case, the probability of the dental filling not falling out. To put it a bit crudely, we are interested in the probability of it surviving in your mouth as time passes. We can estimate the survival function - the probability of surviving past a particular time - using the Kaplan-Meier estimator. Let's calculate the survival probability and its changes over time by hand.

On the x-axis, we mark the time in years, and on the y-axis, the probability of the dental filling staying in place. Assuming that your dentist has done a good job when you exit his clinic, the probability of the dental filling staying in place is equal to 1. That means that at time 0, you have a 100% chance of the filling staying in your mouth - let's mark that on the graph. No one loses their filling in the first year, so the probability remains 1. However, when we get to the second year, Anthony's filling gives in and falls out. So after two years, 1 filling from 10 at-risk falls out. The probability of a filling staying in place is thus reduced by 0.1 (1/10) leading to a new probability of 0.9.

The probability remains the same from years 2 to 3. After three years, Bert's filling falls out, and Chloe's data point gets censored. Only nine people are at risk since Anthony's filling has already fallen out. The probability that the filling stays in place at year three is thus 1 minus 1/9, so 0.89. However, because the probability is cumulative, we have to multiply the probability of staying in place at this time by the probability from the previous year. Thus the probability at year 3 is really 0.89 multiplied by 0.9, which is approximately 0.8.

From years 3 to 4, the probability remains the same yet again. After four years, however, two of my friends, David and Elle, lost their dental fillings. How many people are at risk of losing their filling after four years? We started with 10, Anthony and Harry have already lost theirs, and Chloe is no longer at risk for some other reason. So there are only seven people at risk. Notice that this is where we took Chloe's censored data point into account. So the survival probability is 1 minus 2/7 multiplied by 0.8, the probability of staying in place from the previous year. The survival curve thus falls to 0.57.

The Kaplan-Meier plot is a visual representation of the survival function.

The survival median is the time at which the survival probability drops to 0.5 or 50%.

We make these small calculations for the rest of the time points and draw the steps until we reach the end of our 10-year observation window. The graph we have produced is the Kaplan-Meier plot, which is one of the most used plots in survival analysis. It shows us how the survival probability changes over time. So, for instance, if we want to know the time at which the survival probability drops to 0.5, we can read it on the plot. We can see that the survival probability has fallen to 50% after seven years. In the literature, this time, called the survival median, is often marked on the Kaplan-Meier plot.

We can reproduce our manual analysis with a computer. We will use Orange, a data mining tool that uses visual programming in the form of data-analysis pipelines consisting of interconnected widgets. If you are entirely new to Orange, you are welcome first to watch a few introductory videos available here. We will run Orange, close the welcome screen, and install the Survival Analysis add-on. The list of add-ons is available from the Options menu. We select Survival Analysis and click Ok. Orange will reload automatically.

Let's load our small dataset into Orange. I will use the File widget and load the data from the Desktop file we have created. The time and event columns must be marked as meta-features. Orange did so automatically. We can inspect the data in the Data Table. Then we have to inform Orange, which are the time and event columns, so that it knows it's dealing with survival data. We do this with the As Survival Data widget. Now we have to connect the output from As Survival Data to the Kaplan-Meier widget. On the left of the Kaplan-Meier widget, we can choose to display the median, confidence intervals, and censored data. We have successfully reproduced the previous plot in Orange.

The Kaplan-Meier plot in Orange includes a plot legend where you can see the number of non-censored observations out of all of the data (6/10 in this case) as well as the survival median (7).

Chapter 2: Exploring Survival Features

In survival analysis, we are often interested not just in how long something lasts, but in why it lasts longer in some cases than others. By exploring additional features in our dataset—like material type or brushing habits—we can form meaningful groups and compare their survival outcomes. This helps us uncover which factors might influence longevity. In this chapter, we will learn how to group data, compare survival curves, and identify the most informative features. We will cover the following concepts:

Categorical Feature: A variable with distinct categories, like material type (ceramic or composite), used to form groups for comparing survival.
Continuous Feature: A variable with a range of numeric values, like brushing time, requiring thresholds to split into groups.
Data Grouping: The process of dividing data based on a feature to compare survival curves between subgroups.
Kaplan-Meier Curve: A graph showing survival over time, used here to compare different groups based on selected features.
Log-Rank Test: A statistical test that checks whether the difference between two survival curves is likely due to chance.
Threshold: A chosen value used to split a continuous feature into two groups (e.g., brushing more or less than 6 minutes).
Feature Ranking: Automatically evaluating which features best separate the data into groups with different survival outcomes.
P-Value: A number indicating whether the difference in survival between groups is statistically significant (typically below 0.05).
Discretization: Converting a continuous feature into categorical bins to enable group comparisons.

2.1 Forming and comparing groups

Besides the survival time and event information, observations in a survival dataset are often characterized with a feature or two. We can use the features to form groups and compare their survival curves. Forming groups differs whether the feature is categorical or continuous.

Returning to our previous example, we have expanded the dental fillings dataset to include 10 more samples and three additional features. The dataset is available in the Datasets widget in Orange. We inspect it in the Data Table.

The first additional feature concerns the type of material out of which the dental filling was made of. It turns out they were either composite or ceramic, so type of material is a categorical feature. Anthony got a composite filling, and so did Bert, however, Chloe got a ceramic one and so on. The second additional feature is named Brushing time and denotes the average amount of time in minutes someone spends brushing their teeth daily. Anthony uses an electric toothbrush, so he’s timed his brushing to 4 minutes daily, while Bert says he takes a bit more time. Chloe, on the other hand, has braces and thus spends 15 minutes per day brushing her teeth. Brushing time is a numeric feature, meaning its values are continuous. Lastly, there is another categorical feature, this one denoting whether the subject of the study prefers cats or dogs.

To form groups based on a categorical feature, we split participants based on the category to which they belong.

If we want to form groups based on a categorical feature, we can do so in the Kaplan-Meier widget. We first pass the data through the As Survival Data widget to mark the features that record time and event. We can then inspect the data in the Kaplan-Meier plot. When we have a categorical feature, such as the type of material, it’s easy to form groups of data instances. In our case, one group were friends with ceramic filling, and the other friends with composite filling. We can group the samples in the Kaplan-Meier widget and draw two survival curves on the same plot, each one corresponding to the filling type. It seems that ceramic dental fillings have a better prognosis of staying in place than composite ones.

The plot legend now shows information about the formed groups: the number of non-censored observations out of all the participants within a group (e.g., 8/10 for the group with the composite filling) and the median survival time for each group (5 years for the group with the composite filling). Later we will also use

To form groups based on a continuous feature, we have to define a threshold and split participants based on that threshold. There are several ways of defining a threshold in Orange. Here we use the Select Rows widget to specify a threshold with a conditional statement. Later on we will use Distributions and Discretize.

On the other hand, if we want groupy by a continouus feature we have to define a threshold value to form groups. We will form two groups, one whose members brush their teeth more than six minutes a day, and the other one whose member brush less than that. We use the Select Rows widget to define the variable and the threshold. Normally, Select Rows outputs just the data that matches the condition, but in our case we need all the data with the indicator if the condition, brushing over six minutes, was matched. When connecting the Select Rows to Kaplan-Meier, we have to therefore rewire the connection to indicate Select Rows is sending out all the data. We can now open the Kaplen-Meier plot and choose Selected under Groups. We plotted two curves, one for people that spend over 6 minutes brushing their teeth indicated with Yes, and one for those that spend more time than that, indicated with No.

Let’s compare this plot with the previous one using the type of material to form the two groups. Grouping the subjects of our study by material type made a more considerable difference in the survival curve. We gathered this just by visually inspecting the data.

Of course, it doesn’t make sense to use just any feature to form groups. Not all features affect survival, so not all of them will separate the data into groups with different survival curves. For instance, one can assume that whether a person prefers dogs to cats does not affect how long their dental filling lasts. We can check this by forming groups by the last feature which contains information on whether the person prefers cats or dogs. Since this is a categorical feature, we simply open the Kaplan-Meier widget and select to group by this feature. The survival curves are barely separated. Preferring dogs to cats really doesn’t affect the survival of dental fillings.

To evaluate the difference between survival curves we use the log-rank test.

Although visual analytics is a useful way of exploring the data, there is of course a more systematic way of comparing how well a particular feature separates the survival curves which is called the log-rank test. It computes how likely the difference between survival curves is not random. The smaller the p-value, the more likely our feature actually separates the data into groups with different survival outcomes. We can see this value next to the Kaplan-Meier plot in Orange. Grouping by type of material gives us a smaller p-value.

2.2 Ranking Survival Features

We previously estimated the difference in survival between two cohorts on a Kaplan-Meier plot and manually identified the feature that led to cohorts with distinct survival characteristics. However, real world survival datasets often include more than just 3 features to choose from. This time we will be working with a larger dataset. We will start by exploring the features manually. Then, we will let Orange do some of the work for us.

We can find the available survival datasets in Orange using the Dataset widget. Just type "survival" into the search bar. For this example let’s use the German Breast Cancer Study Group data. Select the dataset and conenct it to the Data Table widget. The first two columns show the time and event. Specifically the Recurrence Free Survival Time, the time between the start of the study and the recurrence of cancer. The rest of the data is full of other clinical variables. Some of them categorical, like tumor grade, and other continuous ones, like the patient's age. Note, we dont use As Survival Data widget here because this dataset is already in the correct format for survival analysis.

We already know how to form and compare groups manually. For categorical features, we simply pass the data to the Kaplan-Meier widget. There are three categorical features to choose from: Tumor Grade, Menopausal Status, and Hormonal Therapy.

A p-value below 0.05 indicates that there is a significant difference between the survival curves.

Let's try Menopausal Status and Hormonal therapy. Also, let's plot the median survival time and the confidence intervals to get a better idea of the data. We find that being in menopause doesn't really have much effect on survival; the curves are barely separated, and the p-value is quite large. What about Hormonal therapy? This, on the other hand, is very informative. The patients that did not receive hormonal therapy had a significantly worse prognosis.

Using numeric features to form cohorts takes an extra step; we need to define a threshold to split the data. In the previous section we used Select rows for this, but this time we do this with the Distributions widget. Say we're interested in whether there is a significant difference in survival between patients above and below the age of 60. In the Distributions widget, change the bin width to a small number. Five will do. Then select the population above the age of 60 by interactively selecting bins.

The Distributions widget allows us to define groups by interactively selecting a part of a feature's distribution.

To see what we’ve done we’ll use the Data Table widget. The default output of the Distributions widget is only the selected data; in our case, this means only the patients above 60. But we want all the patients, so we have to rewire the connection. After rewiring we can see that the Data Table has an extra column called "Selected" that specifies the group.

Remember, Orange widgets can have different outputs. When we select a part of an interactive visualisation only the selected part of the data is passed on. If we wish to compare the selected part of the data to the non-selected one, we have to rewire the connection and pass on all the data.

Now we can send the data from Distributions to Kaplan-Meier widget. It turns out that whether a patient is above or below 60 doesn’t make a big difference in the survival probability over the observed time.

We can also try different thresholds now that we have constructed our workflow. For example, let's select everyone over the age of 40 in Distributions. You can see that Orange automatically reflects this change in the plot. And there is a more significant difference between the survival curves now.

The survival curve of the selected group is always red. Furthermore, the selected group can be above the threshold (e.g., above a certain age) or below it (e.g., below a certain progesterone receptor value).

Next, we can try another continuous variable like the Progesterone receptor. Let's select just the first bin since it already contains more than half the patients. This time we really see a big difference.

To make comparing features easier, we can use the Rank Survival Features widget, which forms cohorts for each variable and evaluates their difference in survival using the log-rank test. So let's send our data to Rank Survival Features. There's a selection of two scoring methods for establishing which feature is most predictive of survival. Orange automatically selects the multivariate log-rank test, and we'll stick with that. Next, we can sort the features by p-value; we find the most informative is the Number of Positive Nodes. Positive nodes refer to lymph nodes in the armpit area where metastatic cancer cells have been found.

The Rank Survival Features widget uses the median value as a threshold for the continuous features.

On the left of the Discretize widget, we select the feature we want to split by, and on the right, we tick the option to split by equal frequency intervals. Notice that the red survival curve corresponds to the group above the median of a given continuous feature.

We can inspect what the survival curve actually looks like in this case. The output of the Rank Survival Features widget is a reduced dataset containing the time and event columns along with the selected feature. So we can choose "Number of Positive Nodes," and then we need to split the data at the median again. We could do this via Distributions, like before, or as an alternative, we could use the Discretize widget. So let's connect them and split our data into two intervals of equal frequency. Now we can connect this to another Kaplan-Meier widget, and there we go. We successfully identified the most informative feature regarding survival just with a few clicks.

Chapter 3: Ranking genes

In the previous chapter, we explored breast cancer data and ranked the eight clinical features according to their effect on survival. Emerging data sets in biomedicine can include many more features though. For example, tissue samples are characterized by the expression of thousands of genes. We'll now learn how to explore such data according to survival, and cover the following concepts:

Gene Expression: A measurement of how active a gene is in a given tissue, often used to study disease outcomes.
Bioinformatics: A field that combines biology and computing, used here to analyze large-scale gene data.
Gene Set: A group of genes linked by a common function or pathway, like the RAS signaling pathway that plays a key role in cell growth, proliferation, differentiation, and survival.
Recurrence-Free Survival: Time from treatment until the cancer returns.
Overall Survival: Time from treatment until death related to the disease.
Median Expression Split: A method of forming groups based on whether a gene's expression is above or below its median value.
Feature Ranking: Ordering genes (or features) by how well they separate survival outcomes.

For our analysis, we will need to install the Bioinformatics Add-On, which is available in the Options menu. Orange needs to restart after installing an add-on. We will explore the METABRIC study, which includes the survival data of 1904 patients with primary breast tumors. Each patient sample is characterized by 35 clinical features and the expression of over 24 000 genes. We can take a quick look in the Data Table widget. Scrolling to the right, we can see the clinical features and the thousands of gene expression values. We also find the data appropriately includes the Time and Event marked as target features. However, there are two more Time and Event columns marked as meta-features. This is because, in this study, they measured the recurrence-free survival time as well as the overall survival. Recurrence-free survival time refers to the time until cancer recurrence, and overall survival refers to the time until breast cancer-related death. We will use the As Survival Data widget to specify which time and event pair we want to explore. Let’s select overall survival.

Note that survival datasets can have more than one Time and Event feature. Use As Survival Data to select the appropriate pair.

If we want to use the bioinformatics widgets on some dataset of genes, we have to pass the data through the Genes widget. This will match the names of genes with the NCBI Gene database and annotate the data with the appropriate gene codes. The first time you use this widget, it will have to load all the data from the server, which may take a few moments. Now that we’re done, we can see that each gene has an ID and a short description. From 24000 genes, the widget matched 18000, so we will be working on a slightly reduced dataset.

With these many features, you might wonder, how do we analyze all of this data? Many biomedical researchers are focused on finding genes that could be used for cancer prognosis. For example, breast cancer researchers have identified the proto-oncogene KRAS as an important factor. An article reported that high expressions of KRAS are linked to significantly worse projections for patients in the METABRIC study. Let’s try reproducing one of the Kaplan-Meier plots from this report in Orange.

Gene expression values are continuous. We already know that this means defining a threshold to form groups to compare survival. We can again do this with the Distributions widget. On the left, we filter the features to find KRAS and then adjust the bin width. The gene expression values in this data set are normalized, so we’ll select the values above 0. This should split the data more or less in half. We can connect Distributions to the Kaplan-Meier Plot and rewire the connection to pass on all the data. In the Kaplan-Meier widget, we choose Selected for the group indicator and tick the boxes to display the confidence intervals and the median. The p-value below 0.005 indicates that there is indeed a significant difference in the survival curves between patients with KRAS expression above or below 0. The ones we selected on the Distributions graph, those with expression value above 0, are depicted by the red line.

Gene sets are lists of genes associated with a specific biological function. Preloaded gene sets are available in Orange through the Gene Sets widget.

KRAS is a member of the RAS protein family. RAS proteins function as molecular switches for signaling pathways critical to several aspects of normal cell growth. They are mutated in a variety of human tumors. The set of all RAS genes forms a gene set. More generally, gene sets are lists of genes associated with a specific biological function and are widely used in studying expression data. We will compare how well other genes in the RAS pathway separate the survival curves of cohorts.

Pass the information from Genes to the Gene Sets widget. This widget can filter out a specific gene set from our expression data. On the left, we can select the organism and the gene set database we are interested in. Homo sapiens is already correctly marked. The Ras pathway is available through the KEGG pathway database. We choose KEGG on the right and write ‘ras’ in the filter tab. The Ras signaling pathway contains 232 genes, but only 219 have been found in our METABRIC dataset. Click on the pathway to output the filtered genes and their expression values.

Now, connect the output with Rank Survival Features. Previously, we used this widget for ranking clinical features to figure out which ones best separate cohorts according to survival. This time, we will use it for ranking genes in a gene set! Because gene expression values are numeric, the Rank Survival Features widget forms cohorts by splitting the expression values at the median. Let’s order the genes according to the p-value. We find that KRAS is only in 33rd place! The gene that is best at separating patients by survival is FLT3. Select the gene by clicking on it. If we want to know what FLT3 means, we can pass the information to the Genes widget. The protein product of FLT3 expression functions as a receptor tyrosine kinase that is normally expressed on hematopoietic stem cells. Its mutations are well studied in acute myeloid leukemia; however, upregulated FLT3 expression has also been observed in lymph node metastases of patients with primary breast cancer.

On the left, select FLT3, and on the right, choose to form intervals of equal frequency.

Let’s go ahead and plot the survival curves defined by FLT3. Pass the output from Rank Survival Features to Discretize and then to the Kaplan-Meier Plot. We group by FLT3 and select to display confidence intervals and the median.

We can now compare the two plots side by side, one grouping patients according to the expression of KRAS and the other according to the expression of FLT3. There is a visible difference! Notice that whereas above median expression of KRAS is predictive of poor survival, above median expression of FLT3 is predictive of higher survival probability. Indeed, a quick search through the literature confirms that higher expression of FLT3 indicated a favorable prognosis in patients with breast cancer.

Chapter 4: Ranking gene sets

In this last chapter, we will learn how to find informative gene sets regarding survival. That is, instead of single genes, we will consider groups of genes. The concepts we will cover are:

Gene Set: A collection of genes that share a common biological function or pathway.
Enrichment Score: A value that will here reflect how strongly the genes in a gene set are collectively expressed in a single sample, indicating how active the related biological process is.
Single Sample Scoring: A method to assign an enrichment score for each gene set in each sample individually.
Gene Set Annotation: The process of mapping gene names in the dataset to standardized gene set databases.
Hallmark Collection: A curated set of gene sets representing core biological processes, used for enrichment analysis.
Gene Set Ranking: Evaluating and ordering gene sets by how strongly their enrichment scores relate to patient survival.

First, we will load and preprocess the data in Orange. We will work with a survival dataset from The Cancer Genome Atlas - or TCGA - that includes 306 samples from patients with cervical cancer. Let’s inspect the data in the Data Table. Each sample includes the overall survival time, an event column with information on whether or not the patient died of cervical cancer, and the expression values of over 23000 genes.

Next, we must pass the data through the Genes widget to map the names to the gene IDs we will use later in the workflow. This can take a few moments. In the top left corner, we see that about 22000 genes have been successfully matched with their NCBI IDs.

A gene set is a list of genes representing a particular biological function, usually corresponding to a specific molecular pathway. For example, genes involved in glucose degradation make up the glycolysis gene set. We want to know how the over- or under-expression of a particular pathway correlates with the survival of patients. One way of doing this is first to evaluate the expression of a pathway in each sample. A high pathway enrichment score will roughly tell us that the molecular pathway is over-expressed. We will use a version of gene set enrichment analysis modified to evaluate single samples called single sample gene GSEA.

In Orange, single sample GSEA is implemented in a widget called Single Sample Scoring. We can see that this widget requires two inputs: the expression data and one or more gene sets.

We will choose a molecular pathway from the Gene Sets widget. Let's look at the collection of Hallmark gene sets from the Molecular Signatures Database. Hallmark gene sets represent specific, well-defined biological processes and are a useful gene set collection to start with. On the right, we can see that this collection includes gene sets corresponding to inflammatory response, hypoxia, heme metabolism, etc. We will select the hypoxia gene set, which contains genes that are up-regulated in response to low oxygen levels. Now we check that single sample GSEA is the method chosen in the Single Sample Scoring widget. We can now connect Gene Sets to the Single Sample Scoring widget.

It immediately starts calculating an enrichment score for each sample. We inspect the output in another Data Table. We find the same gene expression data table as before. Except now, there’s a new column representing the enrichment score for the hypoxia hallmark gene set. A higher value suggests that hypoxia-related genes are overexpressed in that particular sample.

To find how informative a pathway is in relation to survival, we can treat the enrichment scores for a given gene set as a new continuous variable. As previously with continuous data, we have to split the data at some threshold, i.e., the median. First, reduce the dataset to include only the two survival features and the enrichment scores for the hypoxia gene set. We can do this in Select Columns. Under Features, select all of the genes and move them to the Ignored list. Then choose the hypoxia hallmark gene set from the Metas list and move it to the Features list. The widget status bar shows that its output is now a reduced dataset with only one column.

Note, instead of Discretize we could use Distributions or Select Rows to define a threshold.

Pass this reduced dataset to Discretize. Select the hypoxia gene set feature, and on the right, choose to split the data into intervals of equal frequency. Next, pass the data on to the Kaplan-Meier widget. Upon opening it, select to group by the gene set of interest and choose to display the confidence intervals and the median. The red line represents patients with the up-regulation of genes that respond to low oxygen levels. We can see that the up-regulation of hypoxia-related genes is associated with a worse survival prognosis. Hypoxia is often observed in larger tumors since the proliferation of cancer cells can cause the tumor to outgrow the network of blood vessels that supplies tumor cells with oxygen.

Instead of choosing a specific gene set, such as the hypoxia one, we might want to compare several gene sets in relation to survival. Previously, we compared how different clinical features or gene expression values correlate with survival by splitting the data according to each feature, calculating the log-rank statistics, and ranking the features based on the p-value. We can do the same with gene sets by first calculating the enrichment scores for each set.

Open the Gene Sets widget again, but this time, select all 50 hallmark gene sets and pass them to the Single Sample Scoring widget. Calculating the enrichment scores for each gene set might take a few moments. We check the results out in a Data Table and see the table has 50 new columns. Each one corresponds to a hallmark gene set.

Open Select Columns one more time. The hallmark gene sets are marked as Metas, so move them to the Features window.

Now we can use the Rank Survival Featureswidget to rank the gene sets according to survival. Let’s order the genes according to the p-value. The three best-ranked gene sets are those related to angiogenesis, signaling via tgf-beta cytokine, and the transition of cells from epithelial to mesenchymal phenotype.

If we want to produce the corresponding Kaplan-Maier plots, we select the top three and pass them on to Discretize widget. We again mark the three gene sets and choose to split them into intervals of equal frequency.

Finally, we pass the data to the Kaplan-Meier widget, tick the boxes for showing the median and confidence intervals, and group by one of the three gene sets.

For all of the top three ranking gene sets, over-expression of the gene set is associated with a worse prognosis. Angiogenesis is the process of blood vessel formation. Cancer cells, like all cells, need nutrients and oxygen to grow and function properly. So when a tumor mass grows, blood vessels are formed to ensure a fresh blood supply. The literature also indicates that tgf-beta exerts tumor-promoting effects in late-stage cancer, increasing tumor invasiveness and metastasis.

Similarly, epithelial-mesenchymal transition, which refers to the dedifferentiation of epithelial cells, has been shown to be involved in the initiation of metastasis in cancer progression. The correlations we found between gene set overexpression and biological activity are intriguing. They make sense to us because of prior knowledge, but we should always keep in mind that they do not prove causation.

We have reached the conclusion of the Tutorial on Survival Analysis Using Visual Analytics. Throughout the tutorial, we have covered essential concepts and commonly employed methods in survival analysis. With the help of Orange, we have progressed from basic data analysis pipelines for censored data to more intricate ones. Orange offers various interactive data analytics components that facilitate the exploration of survival data.