From this article, we will use R language to perform survival analysis to a data set, in order to demonstrate some syntax and show the procedural of survival analysis using R.
The Data
A very classic data set is used in the purpose of demonstration. Here is a glance of the data set.
> library(survival)
> data(lung)
> head(lung)
inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
1 3 306 2 74 1 1 90 100 1175 NA
2 3 455 2 68 1 0 90 90 1225 15
3 3 1010 1 56 1 0 90 90 NA 15
4 5 210 2 57 1 1 90 60 1150 11
5 1 883 2 60 1 0 100 90 NA 0
6 12 1022 1 74 1 1 50 80 513 0
This data set is about “survival in patients with advanced lung cancer from the North Central Cancer Treatment Group. Performance scores rate how well the patient can perform usual daily activities.” using help()
function to see the description of the data set.
Description of the Data Set
inst: Institution code time: Survival time in days status: censoring status 1=censored, 2=dead age: Age in years sex: Male=1 Female=2 ph.ecog: ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician pat.karno: Karnofsky performance score as rated by patient meal.cal: Calories consumed at meals wt.loss: Weight loss in last six months
For more information about the data, here is a reference: Loprinzi CL. Laurie JA. Wieand HS. Krook JE. Novotny PJ. Kugler JW. Bartel J. Law M. Bateman M. Klatt NE. et al. Prospective evaluation of prognostic variables from patient-completed questionnaires. North Central Cancer Treatment Group. Journal of Clinical Oncology. 12(3):601-7, 1994.
Note that time
and status
are two indicators of “time to event”, while others are covariates in this data set. Among them inst
, sex
, ph.ecog
and ambulatory
are categorical data while age
, ph.karno
, meal.cal
and wt.loss
are quantitative data.
Visualization of the Data Set
Before we started to do survival analysis, first thing is to do an overview of the data. Here, two methods are used. The first one is a heatmap from library ComplexHeatmap
, which provided various tools for heatmaps, documentation is here. And the second one is a correlation matrix, which provides linear relationships for all pairs of variables and distributions of variables, R has many packages that provide good visualization of correlation matrix, e.g. corrplot
, PerformanceAnalytics
and what I used here, GGally
.
### Heat Map ###
> library(ComplexHeatmap)
> lungm <- t(scale(data.matrix(lung)))
> ht = Heatmap(lungm,
name = "feature value",
column_title = "patients",
row_title = "features",
show_column_names = FALSE)
> draw(ht)
### Correlation Matrix ###
> library(GGally)
> ggpairs(lung)
From the heatmap, NA
s are marked with grey. We can see some missing values in pat.karno
or wt.loss
but not much, which is good. From correlation matrix, it is hard to see strong correlations between any two of the variables, and many of them are categorical data. The time
is clearly not gaussian distributed. the age
is skewed to older groups.
After getting some sense about the data, survival analysis can start.
Kaplan-Meier Analysis and Log-rank Test
Suppose we want to first see the overall survival condition (basic KM curve) and then see is there any significant varied patterns of patients who have different sex
(KM analysis curve & log-rank test).
> library(survminer)
### KM Analysis ###
# basic curve
fit1 <- survfit(Surv(time, status)~1, data = lung)
ggsurvplot(fit1, data = lung, censor=T)
# KM analysis curve
fit2 <- survfit(Surv(time, status)~sex, data = lung)
ggsurvplot(fit2, data = lung, pval = T, pval.method = T, conf.int = T, censor=T)
Here, survminer
package is used to plot KM curves by ggsurvplot()
function, here are some parameters that I found useful:
conf.int: logical value. If TRUE, plots confidence interval.
pval: logical value, a numeric or a string. If logical and TRUE, the p-value is added on the plot. If numeric, than the computet p-value is substituted with the one passed with this parameter. If character, then the customized string appears on the plot. See examples - Example 3.
pval.method: whether to add a text with the test name used for calculating the pvalue, that corresponds to survival curves’ comparison - used only when
pval=TRUE
.
In the second plot, log-rank test is automatically performed and p-value is specified on the plot.
Now, we want to see other covariates’ influences to the survival time.
Cox Regression
Cox proportional hazard model and its analysis is performed using survival
as well as survminer
.
coxph()
is used to perform general Cox PH model, its documentation can be found here.cox.zph()
is used to test the proportional hazards assumption forcoxph()
with documentation here.ggforest()
: in order to get a visualization of the influences of covariates,ggforest()
fromsurvminer
is used to create a plot, doc.
### Cox Regression ###
> library(survminer)
> # Cox PH Model
> cox <- coxph(Surv(time, status) ~ age + sex + ph.karno + pat.karno + meal.cal + wt.loss, data = lung)
> cox
Call:
coxph(formula = Surv(time, status) ~ age + sex + ph.karno + pat.karno +
meal.cal + wt.loss, data = lung)
coef exp(coef) se(coef) z p
age 0.0090807 1.0091220 0.0117503 0.773 0.4396
sex -0.4859823 0.6150927 0.1995728 -2.435 0.0149
ph.karno -0.0023393 0.9976635 0.0079466 -0.294 0.7685
pat.karno -0.0193962 0.9807907 0.0077533 -2.502 0.0124
meal.cal 0.0000126 1.0000126 0.0002460 0.051 0.9592
wt.loss -0.0080307 0.9920014 0.0073014 -1.100 0.2714
Likelihood ratio test=17.53 on 6 df, p=0.007508
n= 169, number of events= 122
(59 observations deleted due to missingness)
> # Test Proportional Hazards Assumption
> czph <- cox.zph(cox)
> czph
chisq df p
age 0.5502 1 0.4583
sex 1.4804 1 0.2237
ph.karno 7.9155 1 0.0049
pat.karno 3.8774 1 0.0489
meal.cal 5.1873 1 0.0228
wt.loss 0.0143 1 0.9050
GLOBAL 14.5493 6 0.0241
> # Visualize Results
> ggforest(cox, data = lung)