HomeSurvival Analysis Tutorial

This quiz consists of 0 mandatory questions where you can gather 0 points.
You will complete the quiz by answering all the questions and gathering at least 0 points.

Answered: 0 / 0

Achieved points: 0 (NaN %)

Survival Analysis Tutorial

A Self-Paced Tutorial on Visual Analytics for Survival Analysis

Welcome to the Survival Analysis Tutorial! In this tutorial, you'll learn the intricacies of analyzing and interpreting survival data, an increasingly important skill in biomedicine. Whether you're investigating the efficacy of new treatments, understanding disease progression at the molecular level, or exploring the lifespan of cells under various conditions, survival analysis is an indispensable tool. This tutorial provides a unique opportunity to challenge yourself and brush up on your data science skills. We'll walk you through the key concepts of survival analysis, and you'll learn about censoring and key visualizations of survival data. By the end of the tutorial, you'll know how to identify important features that can separate your data into groups with different survival probabilities, identify potential gene markers, and find a set of genes that affect survival. Your journey to mastering survival analysis begins here!

We will be using Orange; you can install it by visiting its download page.

The tutorial has four chapters, each including video lectures and links to the lecture notes. There is a short quiz at the end of each chapter.

If you have any questions or need further clarification, please feel free to contact us on our discord server.

We offer this material under Creative Commons CC BY-NC-ND licence.

Chapter 1: Survival Data and Survival Curve

What is special about survival data? What is censoring? How do we visually represent changes in survival probabilities over time? In this chapter, we will go through some basic concepts of survival analysis. We will work with a toy example of survival data and show how to plot the survival curve, one of the key visualizations in survival analysis.

Please start by watching the following video:

Warmup Questions

If the video was too fast for you, or if you want to dive deeper into the concepts covered, we have also prepared a supplemental notebook. The topic of the Data & Survival Curve video is covered in Chapter 1. Please note that this is a complementary resource. To answer the warmup questions, going through the introductory video should suffice.

Survival analysis refers to a set of statistical techniques used to analyze data where the outcome variable is the time until an event's occurrence. (1pt.)

You have 2 attempts.

In survival analysis, what is the type of outcome variable? (1pt.)

You have 2 attempts.

The occurrence of an event is typically described by a binary variable with values of 0 or 1. (1pt.)

You have 2 attempts.

If an individual has not experienced the event before the study ends, their survival time is considered censored. (1pt.)

You have 2 attempts.

Survival Data And Estimation Of The Survival Curve

Consider the following data:

In our graph, the subjects (A, N, F, …) are in rows. The start of the blue line tells us when we started observing the subject; this could be the time of the intervention, for example. The end of the blue line tells us when we last checked or observed an event. Observed events are marked with an X, all other data are censored. For example, we started observing subject K in week 5, and the event occurred in week 8. The total survival time for K is therefore 3 weeks. The survival time for N is 4 weeks, and at the last visit the event has not yet occurred.

Use the graph above and construct a data table ready for survival analysis. Use Excel or a similar spreadsheet program. Feed the data into Orange and answer the following questions:

What is the average survival time calculated from the constructed data table? (1pt.)


You have 2 attempts.

What is the proportion of censored observations in the data? (1pt.)


You have 2 attempts.

By definition, median survival is the time at which half of the participants in a study population are expected to experience the event of interest. In Orange, you can view its value in the Kaplan-Meier widget.

What is the median survival time (in weeks) for our group of patients? (1pt.)


You have 2 attempts.

What is the survival probability of the patients at week two? (1pt.)


You have 2 attempts.

Chapter 2: Exploring Survival Features

In this chapter, you will learn how to create groups of samples (subjects, patients) using additional features from the data set and how to assess whether these groups showcase a difference in survival. You will also learn how to automatically rank and find the data features that best correlate with survival.

To get started, watch the video below to see how to group your data and compare the survival curves of different groups:

Please also watch the video about survival-based ranking of data features:

We also cover the topic from these two videos in Chapter 2 of the accompanying notebook.

These two videos were about the creation and comparison of groups using both categorical and continuous features. We delved into the process of generating new data cohorts, comparing and visualizing survival curves, and implementing Kaplan-Meier plots using visual programming in Orange.

Exploring Survival Differences Between Cohorts Of Patients

Here we take a look at the U.S. Veteran's Administration Lung Cancer Trial (from Kalbfleisch D and Prentice RL, 1980), in which male patients with advanced, inoperable lung cancer received either standard therapy or investigational chemotherapy. The data set includes 137 patients, 9 of whom left the study before death. The study was designed to assess the benefit of test chemotherapy and analyze the effects of other covariates.

Your first task is to load the data into Orange and view the survival curve for the entire cohort of patients.

You can easily access this study's data in Orange using the Datasets widget. Search for "Veterans" and the dataset should appear at the top. This data already includes the information on time and event features (time since diagnosis [months], survival event), and you can feed the data to the Kaplan-Meier plot directly from the Datasets widget; you don't need to use the As Survival widget here, although it doesn't hurt.

What is the median survival time (in days)? (1pt.)


You have 2 attempts.

The next task is to split the data into two cohorts. Our lung cancer trial dataset contains a variable called "Treatment" that indicates whether patients received standard chemotherapy or an alternative, test chemotherapy. We can use this variable to split the patient into two cohorts, the control and the test group. Use the Kaplan-Meier widget to answer the following questions:

While you can calculate the desired summary statistics by combining some of the other widgets in Orange, you can get the answer to the questions here directly with the Kaplan-Meier widget by grouping the data in this widget accordingly. The graph legend reports the counts, where "N" in the legend refers to the size of the cohort, and "n" refers to the number of uncensored (surviving) patients.

How many patients received standard chemotherapy treatment? How many censored observations are in this cohort? (1pt.)


You have 2 attempts.

How many patients received test treatment? How many censored observations are in this cohort? (1pt.)


You have 2 attempts.

Choose the correct interpretation of the comparison between the two chemotherapy treatments: (1pt.)


You have 2 attempts.

Using Different Features for Splitting Patients into Cohorts

Let us analyze the effect of other variables in this data set. How does age affect patient survival? How is survival affected by the Karnofsky performance score?

The Karnofsky Performance Score is a scale used to quantify a patient's general well-being and ability to carry out daily activities. Its typical values are 10 to 30 for fully hospitalized patients, 40 to 60 partially hospitalized patients, and 70 to 90 for patients able to care for themselves. Observe the differences between survival for two cohorts of patients split by the median value for age, and similar for two cohorts based on Karnofsky performance score. You can use the Discretize widget and "Equal frequency" discretization for grouping by the median value.

Survival is different (p<0.05) when patients in the lung cancer datasets are split into two cohorts (using the median as a threshold) by (1pt.)


You have 2 attempts.

Select the correct interpretation of the effect on survival of the two features: (1pt.)


You have 2 attempts.

Chapter 3: Ranking Genes

What if our data contains thousands of features, for example, as is the case in some molecular biology datasets? How do we approach survival analysis in such cases? Here, you will learn how to use gene expression data for survival analysis. We will show you how to identify genes that are most predictive of survival.

Please start by watching the video below:

This topics from our video are also covered in the Chapter 3 of the companion notebook.

In the video, we explored Orange's bioinformatics tools for analyzing and interpreting gene expression datasets, focusing on the relationship between gene expression and survival in breast cancer. Using the METABRIC study as an example, we demonstrated how to identify survival marker genes and understand their role in cancer prognosis.

Expression Data In Survival Analysis

Let’s look at the cervical cancer data from The Cancer Genome Atlas (TCGA). The dataset is available in the Datasets widget under “TCGA-CESC”. It includes survival data and gene expression values for 306 patients with cervical cancer.

In the literature, genes associated with regulating the UV response have been researched as potential therapeutic targets for cervical cancer treatment (e.g., Gu et al., 2019).

Your task is to identify a handful of potential marker genes associated with survival. For simplicity, you will look only at genes associated with a down-regulated ultraviolet radiation (UV) response. Oncogenes from the human papillomavirus (HPV), the leading cause of cervical cancer, have a complicated relationship with cellular response to UV.

The gene set associated with a down-regulated UV response can be found in the Gene Sets widget under the Molecular Signatures Database (MSigDB) among the Hallmark Gene Sets under the name "HALLMARK_UV_RESPONSE_DN".

Gene Ranking

Do not forget to install the Bioinformatics add-on first, and use Gene Sets widget from this add-on to answer the question.

How many genes are in the HALLMARK_UV_RESPONSE_DN gene set? (1pt.)


You have 2 attempts.

Now load the TCGA-CESC data and consider only the genes that make up the down-regulated UV response hallmark gene set.

From the collection of down-regulated UV response genes, which are the top three genes that most influence cervical cancer progression? (1pt.)


You have 2 attempts.

Gene Expression-Based Cohort Analysis

Create two patient groups based on the expression of the highest-ranked gene from the set of downregulated UV response genes. Then, chart the corresponding survival curves.

To answer the following quiz questions, you will have to assemble the longest data analysis pipeline yet, using widgets such as Genes, Gene Sets, Rank Survival Features, Discretize, and Kaplan-Meier. You will become a survival analysis expert!

What is the difference in the survival curves of the cohorts when the patients are split according to the median expression of the top-ranked gene? (1pt.)


You have 2 attempts.

Is overexpression of the top-ranked gene associated with increased or decreased survival? (1pt.)


You have 2 attempts.

We again assume that, given a gene, cohorts are created by splitting the patients according to the median gene expression.

How many top-ranked genes from the collection of down-regulated UV response genes can result in cohorts with significant (p < 0.01) differences in survival curves. (1pt.)


You have 2 attempts.

Chapter 4: Ranking gene sets

There may be many, possibly hundreds of genes that affect survival. For interpretation purposes, it may be helpful to consider only groups of genes instead, such as genes from the same pathway. Here, we will learn how to evaluate the effect of a collection of genes on survival and how to rank gene sets according to their association with survival.

Please start by watching the following video:

The topics from the video are also covered in the Chapter 4 of the accompanying notebook.

In the video, we learned how to quantify the relationship between gene sets and survival. The tutorial guided us through the process of loading, preprocessing, and evaluating the data to understand how the expression of specific gene sets correlates with patient survival.

Gene Sets Ranking

In 2019, Xia et al. published a paper identifying a new set of proteins that cause endogenous DNA damage when overproduced. The researchers discovered these DNA damage-up proteins (DDPs) in E. coli and identified 284 human homologs that are overrepresented among known cancer drivers.

They constructed three sets of genes: a set of all 284 candidate DDPs, a set of DDPs excluding known cancer drivers, and a set of DDPs excluding validated DDPs. Known cancer drivers are those whose gain- or loss-of-function in driving cancer has been established in the literature. Validated DDPs, on the other hand, refer to a subset of DDPs identified as actual DNA damage initiators in human cells by Xia et al. They evaluated the association of the three DDP gene sets with overall survival by calculating a gene set enrichment score for each sample and comparing two cohorts, one above and one below the top tertiles. Using any of the three DDP gene sets resulted in significant survival differences between the formed cohorts. These results indicate that there are genes among the discovered human DDP candidates that were previously unknown to drive cancer.

Check whether the three DDP gene sets (all, known excluded, validated excluded) are associated with decreased overall survival in the BRCA dataset, even when you split the patients into cohorts by the median of the enrichment scores (Xia and co-authors used tertiles for splitting, see above).

The BRCA dataset is available in the Datasets widget under the name “TCGA-BRCA”. Loading the dataset might take a minute.

The gene sets are available here. Use the worflow with File, Genes, and Gene Sets widget (see the image below) to load the gene sets into Orange. Once you load custom gene sets into the Gene sets widget, continue using the widget like you usually would. Alternatively, you can download a pre-constructed workflow and open it with Orange.

Remember to pass the TCGA-BRCA data through the As Survival Data widget. You will need to use the Single Sample Scoring widget to score the gene set. After scoring, remove the gene expression information using Select Columns and move the gene set scores to the Features section. You can score each gene set individually using Discretize and Kaplan-Meier combination, or just use the Rank Survival Features widget to calculate the p-values for the differences in the survival curves.

Which of the three gene sets are associated with decreased overall survival when you split by the median expression values? (given a significance threshold of 0.05) (1pt.)


You have 2 attempts.

We can use the same procedure to check the utility of other gene sets. Your next task is to rank all fifty gene sets from the Hallmark Gene Set collection with regards to overall survival in the BRCA dataset (order by p-value).

Your workflow from the previous question is ready to answer this question as well; just switch to the Hallmark Gene Sets in the Gene Sets widget. Remember to seected all of the listed gene sets, as the Gene Sets widgets output only the setected sets. You will also need to move the computed scores in the Select Columns widget from the Metas panel to the Features.

Which gene set is ranked the highest with regard to survival? (1pt.)


You have 2 attempts.

The 3rd and 4th ranking gene sets Hallmark Gene Set collection are associated with regulating DNA damage. Form groups based on these two gene sets and plot the corresponding survival curves.

Here, you could stay with the workflow from the previous two questions, select the 3rd and 4th gene sets in the Rank Survival Features widget, and than pass it to Discretize to create an indicator variable for the patient cohorts. Make sure you discretize only the two gene sets, and not the Time variable.

Select the true answer. Decreased overall survival is associated with: (1pt.)


You have 2 attempts.