Survival Analysis Exercises in R: 18 Real-World Practice Problems
Eighteen practice problems on survival analysis in R: build Surv objects, fit Kaplan-Meier curves, run log-rank tests, fit and diagnose Cox proportional hazards models, fit parametric AFT models, and predict patient-level survival probabilities. Every solution is hidden until you click; the lung dataset from the survival package is the backbone.
Section 1. Surv objects and Kaplan-Meier basics (3 problems)
Exercise 1.1: Build a Surv object from time and event columns
Task: Construct a survival response object from the lung dataset, treating time as the follow-up duration in days and status as the event indicator (1 = censored, 2 = died). Print the first ten entries so a + clearly marks censored observations. Save the response object to ex_1_1.
Expected result:
#> [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
Difficulty: Beginner
Click to reveal solution
Explanation: Surv() packages two columns into a censored response that downstream functions like survfit() and coxph() understand. The + suffix marks right-censoring (the subject was still alive when last observed). The status column in lung is coded 1/2 (instead of 0/1), and Surv() auto-detects that convention: any non-1 value is treated as the event. If you ever load data where censoring is coded 0/1 with 1 = event, you can pass it directly without recoding.
Exercise 1.2: Fit and summarize an overall Kaplan-Meier curve
Task: An oncology team treating advanced lung cancer patients wants a single overall survival curve for the cohort, ignoring all covariates. Fit a Kaplan-Meier estimator on the full lung cohort and print the survfit summary so the team can read off the number at risk, events, and median survival time. Save the model to ex_1_2.
Expected result:
#> Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
#>
#> n events median 0.95LCL 0.95UCL
#> [1,] 228 165 310 285 363
Difficulty: Beginner
Click to reveal solution
Explanation: The right-hand side ~ 1 is the intercept-only formula, telling survfit() to fit one curve across the whole cohort. Median survival here means the smallest time at which the estimated survival probability falls to 0.5 or below. Of 228 patients, 165 died during follow-up; the rest are censored, mostly because the study ended before they died. The bounds 285 and 363 are the 95% confidence interval on that median.
Exercise 1.3: Read off survival probabilities at fixed time points
Task: A trial statistician needs the estimated 6-month and 1-year survival probabilities (with 95% confidence intervals) from the overall lung Kaplan-Meier curve. Use summary() on the fitted survfit object at times = c(182, 365) and store the resulting summary object as ex_1_3.
Expected result:
#> Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
#>
#> time n.risk n.event survival std.err lower 95% CI upper 95% CI
#> 182 151 63 0.708 0.0303 0.651 0.770
#> 365 65 71 0.409 0.0358 0.345 0.486
Difficulty: Intermediate
Click to reveal solution
Explanation: summary(fit, times = ...) interpolates the step-function estimate at user-specified horizons, which is exactly the format clinical reports want. Roughly 71% of patients are alive at 6 months; that drops to 41% by 12 months. The "events" column is cumulative events up to that time, while "n.risk" is the number still at risk just after the time. The standard errors come from Greenwood's formula and are wider as the curve drops.
Section 2. Group comparisons and log-rank tests (3 problems)
Exercise 2.1: Fit Kaplan-Meier curves stratified by sex
Task: The same oncology team now wants to compare male versus female survival in lung to decide whether sex should enter the regulatory submission table as a stratification factor. Fit Kaplan-Meier curves by sex and print the summary so median survival, events, and risk counts appear side by side. Save the fit to ex_2_1.
Expected result:
#> Call: survfit(formula = Surv(time, status) ~ sex, data = lung)
#>
#> n events median 0.95LCL 0.95UCL
#> sex=male 138 112 270 212 310
#> sex=female 90 53 426 348 550
Difficulty: Intermediate
Click to reveal solution
Explanation: Placing a factor on the right-hand side fits one curve per level. The gap between median survival of 270 days for males and 426 days for females is large enough to suggest a real prognostic signal, but the confidence intervals overlap, which is exactly why you follow up with a log-rank test (next exercise) rather than declaring victory on the medians alone. Plotting these curves with plot(ex_2_1) is the next visual step.
Exercise 2.2: Run a log-rank test comparing male vs female survival
Task: Run the log-rank test on the lung cohort to test the null that the two sex-specific survival curves are identical. Use survdiff() with the default rho = 0 so it is a true log-rank (not Peto). Save the test object to ex_2_2 and inspect the chi-square statistic and p-value.
Expected result:
#> Call: survdiff(formula = Surv(time, status) ~ sex, data = lung)
#>
#> N Observed Expected (O-E)^2/E (O-E)^2/V
#> sex=male 138 112 91.6 4.55 10.3
#> sex=female 90 53 73.4 5.68 10.3
#>
#> Chisq= 10.3 on 1 degrees of freedom, p= 0.001
Difficulty: Intermediate
Click to reveal solution
Explanation: survdiff() computes the log-rank statistic by comparing observed versus expected event counts in each group, summed over event times. With p = 0.001 we reject equality of curves, supporting the visual gap from the previous exercise. The rho argument switches weighting: rho = 0 is the standard log-rank (equal weight across follow-up); rho = 1 is the Peto test (weights early events). Use rho = 0 unless you have a domain reason to upweight early differences.
Exercise 2.3: Stratified log-rank test adjusting for ECOG performance status
Task: A regulatory reviewer worries that the sex effect in lung is confounded by ECOG performance status (ph.ecog, where higher means sicker). Run a log-rank test of sex within strata of ph.ecog so the comparison is event-times pooled across patients with equal performance status. Drop rows with missing ph.ecog first. Save the stratified test as ex_2_3.
Expected result:
#> Call: survdiff(formula = Surv(time, status) ~ sex + strata(ph.ecog),
#> data = lung_complete)
#>
#> N Observed Expected (O-E)^2/E (O-E)^2/V
#> sex=male 137 111 91.7 4.07 9.13
#> sex=female 90 53 72.3 5.16 9.13
#>
#> Chisq= 9.1 on 1 degrees of freedom, p= 0.003
Difficulty: Advanced
Click to reveal solution
Explanation: Wrapping a covariate in strata() tells survdiff() to compute the log-rank within each level of that covariate and pool the resulting statistics. The chi-square barely moved from 10.3 (unstratified) to 9.1 (stratified by ECOG), which means the sex effect is not explained away by performance status: it is a genuine prognostic signal. If the chi-square had collapsed to near zero, you would conclude the apparent sex difference was driven by ECOG imbalance between the groups.
Section 3. Cox proportional hazards modeling (4 problems)
Exercise 3.1: Fit a univariate Cox model for the sex effect
Task: Quantify the sex effect from Section 2 on the hazard scale. Fit a Cox proportional hazards model with sex as the only covariate on lung. Use coxph() and save the fitted model as ex_3_1, then print it so the coefficient, exponentiated hazard ratio, and Wald p-value appear.
Expected result:
#> Call:
#> coxph(formula = Surv(time, status) ~ sex, data = lung)
#>
#> coef exp(coef) se(coef) z p
#> sexfemale -0.5310 0.5880 0.1672 -3.176 0.00149
#>
#> Likelihood ratio test=10.63 on 1 df, p=0.001113
#> n= 228, number of events= 165
Difficulty: Intermediate
Click to reveal solution
Explanation: The Cox model expresses the hazard as $h(t \mid x) = h_0(t) \exp(\beta x)$, leaving the baseline hazard $h_0(t)$ unspecified. The exp(coef) of 0.588 is the hazard ratio: at any time $t$, a female patient's instantaneous risk of death is about 41% lower than a male patient's. The likelihood-ratio test agrees with the log-rank statistic from Exercise 2.2 because, in the univariate case, the two are asymptotically equivalent.
Exercise 3.2: Multivariable Cox model adjusting for age and performance status
Task: A clinical researcher writing a survival section for a paper wants to know whether the sex effect persists after adjusting for age and ph.ecog (drop rows with ph.ecog == 3 so each category has enough patients). Fit a multivariable Cox model with sex, age, and ECOG as predictors. Save the model to ex_3_2.
Expected result:
#> coef exp(coef) se(coef) z p
#> sexfemale -0.554 0.575 0.168 -3.30 0.00097
#> age 0.011 1.011 0.009 1.20 0.22942
#> ph.ecog1 0.412 1.510 0.199 2.07 0.03857
#> ph.ecog2 0.844 2.325 0.226 3.74 0.00018
#>
#> n= 226, number of events= 163 (1 obs deleted due to ph.ecog==3, 1 NA)
Difficulty: Intermediate
Click to reveal solution
Explanation: The sex hazard ratio of 0.575 is almost identical to the univariate 0.588, so the female-advantage signal is robust to age and ECOG adjustment. Age contributes very little once ECOG is in the model (p = 0.23): ECOG already captures most of the "frailty" you would expect age to proxy. ECOG itself is a stepwise prognostic factor: each level beyond fully active doubles or more than doubles the hazard.
Exercise 3.3: Extract hazard ratios with confidence intervals via broom::tidy
Task: Pull the hazard ratios and 95% confidence intervals from ex_3_2 into a tidy data frame so they slot directly into a manuscript table. Use broom::tidy() with exponentiate = TRUE and conf.int = TRUE, keeping only term, estimate, conf.low, conf.high, and p.value. Save to ex_3_3.
Expected result:
#> # A tibble: 4 x 5
#> term estimate conf.low conf.high p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 sexfemale 0.575 0.413 0.799 0.000968
#> 2 age 1.011 0.993 1.029 0.229
#> 3 ph.ecog1 1.510 1.023 2.229 0.0386
#> 4 ph.ecog2 2.325 1.494 3.620 0.000183
Difficulty: Intermediate
Click to reveal solution
Explanation: tidy() returns log-hazard coefficients by default; exponentiate = TRUE converts them to hazard ratios in one step. The conf.int = TRUE flag triggers profile-likelihood-style confidence intervals (Wald-based here for Cox). Working in tibble form lets you gt::gt(), flextable(), or kable() straight into the manuscript without manual string formatting. Whenever a confidence interval excludes 1.0, the variable is significant at the corresponding alpha.
Exercise 3.4: Stratified Cox model treating ECOG as nuisance
Task: A biostatistician suspects ECOG violates the proportional hazards assumption and wants to absorb it through stratification rather than as a covariate. Refit the Cox model on lung_3 with sex and age as covariates and ph.ecog inside strata(). Save the stratified model to ex_3_4 and print the coefficient table.
Expected result:
#> coef exp(coef) se(coef) z p
#> sexfemale -0.546 0.579 0.168 -3.24 0.00118
#> age 0.011 1.012 0.009 1.30 0.19475
Difficulty: Advanced
Click to reveal solution
Explanation: Stratification fits a separate baseline hazard $h_0(t)$ per ECOG level but assumes the sex and age effects are common across strata. You lose the ECOG hazard ratios (they are no longer estimated) but gain robustness if ECOG's effect varies over time. The sex coefficient barely changed from -0.554 to -0.546, so the conclusions hold; stratification is a clean fallback when a covariate fails the PH check.
Section 4. Diagnostics and the PH assumption (3 problems)
Exercise 4.1: Test the proportional hazards assumption with Schoenfeld residuals
Task: Apply cox.zph() to the multivariable model ex_3_2 to test whether each covariate's hazard ratio is constant over time. The PH assumption is the single biggest threat to Cox inference; a small p-value flags a violation. Save the result to ex_4_1 and print it so the reviewer can see per-variable and global p-values.
Expected result:
#> chisq df p
#> sex 2.61e+00 1 0.11
#> age 2.04e-01 1 0.65
#> ph.ecog 6.93e+00 2 0.031
#> GLOBAL 9.51e+00 4 0.050
Difficulty: Advanced
Click to reveal solution
Explanation: cox.zph() correlates scaled Schoenfeld residuals against a transform of follow-up time (default is g(t) = (rank-1)/(n-1)). A small p-value means the residuals trend with time, which means the hazard ratio is changing with time, which means the PH assumption is failing. Here, ph.ecog shows a borderline violation (p = 0.031), which justifies the stratified refit in Exercise 3.4. Sex passes comfortably.
Exercise 4.2: Plot Schoenfeld residuals to visualize PH violation
Task: Use ggcoxzph() on ex_4_1 to plot the scaled Schoenfeld residuals against transformed time for every covariate, with a smoothed reference line. A flat smooth means PH holds; a sloped or curved smooth flags violation. Save the gg list returned by the call as ex_4_2.
Expected result:
# A list of ggplot objects, one per covariate panel; each panel shows
# scaled Schoenfeld residuals against transformed time, with a LOESS
# smooth and 95% confidence band. The ph.ecog panels have a visible
# downward slope, matching its p = 0.031 in the cox.zph table.
Difficulty: Intermediate
Click to reveal solution
Explanation: ggcoxzph() is the visual companion to cox.zph(). The smoothed curve through the residuals should be approximately horizontal under PH; a slope means the log hazard ratio is drifting over follow-up time. For ECOG, the smoothed line drops, meaning the hazard penalty of high-ECOG patients is largest early in follow-up and attenuates later. This pattern is common for frailty-like markers, where the sickest subjects either die quickly or stabilize.
Exercise 4.3: Check functional form of age with martingale residuals
Task: An epidemiologist worries that the age effect in the Cox model is non-linear. Fit a Cox model on lung_3 with no covariates (null model) and plot the martingale residuals against age with a smoother. A roughly straight smooth supports linearity; curvature suggests a non-linear transformation is needed. Save the martingale residual vector to ex_4_3.
Expected result:
#> # First 6 values:
#> [1] 0.4825 -0.0921 -0.1644 0.5147 0.2916 -0.7053
#> # The smoothed plot is approximately linear across age 40-80, with
#> # mild downward curvature past 75, hinting at slightly steeper risk
#> # at the oldest ages but not enough to justify a spline.
Difficulty: Advanced
Click to reveal solution
Explanation: Martingale residuals from a null Cox model equal event - expected events, where expected events come only from the Nelson-Aalen baseline. Plotting them against a candidate covariate reveals the optimal functional form: a straight smoother supports entering the variable linearly, a curved smoother suggests a transformation (log, splines via pspline(), or a categorical bucket). Since this plot is approximately linear, leaving age as-is in the multivariable Cox model was reasonable.
Section 5. Parametric and accelerated failure time models (2 problems)
Exercise 5.1: Fit a Weibull AFT model with flexsurv
Task: A pharmacology team needs a parametric survival model so they can extrapolate the survival curve beyond the observed lung follow-up window (Cox cannot extrapolate without the baseline hazard). Fit a Weibull accelerated failure time model with sex as the only covariate using flexsurvreg(). Save the fitted model to ex_5_1.
Expected result:
#> Call:
#> flexsurvreg(formula = Surv(time, status) ~ sex, data = lung, dist = "weibull")
#>
#> Estimates:
#> est L95% U95% se
#> shape 1.32 1.18 1.48 0.0759
#> scale 388.94 321.51 470.50 37.83
#> sexfemale 1.50 1.16 1.95 0.20
#>
#> N = 228, Events: 165, Censored: 63
#> Total time at risk: 69683
#> Log-likelihood = -1153.85, df = 3
#> AIC = 2313.69
Difficulty: Intermediate
Click to reveal solution
Explanation: A Weibull AFT models log survival time as a linear function of covariates: log(T) = mu + betax + sigmaepsilon. The shape parameter > 1 means hazard is increasing with time (typical for terminal disease). The sexfemale coefficient of 1.50 means females survive 1.5 times longer than males on the time scale, which mirrors the hazard-scale finding (HR = 0.588 ≈ 1/1.70 in Exercise 3.1). Use flexsurvreg() when you need to simulate, extrapolate, or quote median survival as a closed-form quantity.
Exercise 5.2: Compare exponential vs Weibull fits by AIC
Task: Fit both an exponential AFT and a Weibull AFT model to lung with sex as the only covariate. Build a tibble named ex_5_2 with columns dist, aic, and loglik for both fits so the team can see whether the extra Weibull shape parameter is worth its degree of freedom.
Expected result:
#> # A tibble: 2 x 3
#> dist aic loglik
#> <chr> <dbl> <dbl>
#> 1 exponential 2326. -1161.
#> 2 weibull 2314. -1154.
Difficulty: Advanced
Click to reveal solution
Explanation: AIC drops by about 12 when moving from exponential to Weibull, a clearly preferred fit. The exponential is just Weibull with shape fixed at 1 (constant hazard), so the comparison answers "is the hazard rising or falling over time?". A shape estimate of 1.32 with AIC strongly favoring Weibull means the hazard is meaningfully increasing, and assuming a constant hazard would understate late risk. Use AIC over likelihood-ratio when comparing non-nested or differently-parameterized models.
Section 6. End-to-end clinical workflows (3 problems)
Exercise 6.1: Build a risk table from a Kaplan-Meier fit
Task: A hospital reporting analyst building a quarterly dashboard wants a risk table for the lung cohort showing the number at risk and number of events at days 0, 100, 200, 400, and 600. Use summary(fit, times = ...) and assemble a tibble with columns time, n.risk, n.event, surv, surv_lower, surv_upper. Save to ex_6_1.
Expected result:
#> # A tibble: 5 x 6
#> time n.risk n.event surv surv_lower surv_upper
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0 228 0 1 1 1
#> 2 100 196 31 0.864 0.821 0.910
#> 3 200 144 77 0.664 0.605 0.728
#> 4 400 55 124 0.341 0.282 0.413
#> 5 600 22 153 0.176 0.124 0.250
Difficulty: Intermediate
Click to reveal solution
Explanation: The summary object from survfit exposes per-time vectors that you can lift directly into a tibble for reporting tools. The n.event column is cumulative events up to the time point; subtracting consecutive rows gives interval events if needed. A risk table like this anchors a Kaplan-Meier plot in the appendix of a clinical paper, satisfying CONSORT-style reporting requirements that every survival figure must declare risk counts.
Exercise 6.2: Render a publication-grade KM plot with confidence bands and risk table
Task: Reproduce the canonical clinical-paper survival figure for lung stratified by sex: Kaplan-Meier curves with shaded 95% confidence bands, a p-value annotation from the log-rank test, a censoring tick mark layer, and a risk table beneath the plot. Save the survminer object to ex_6_2.
Expected result:
# A ggsurvplot object containing:
# - Two KM step curves (sex=male, sex=female) with shaded 95% CI bands
# - "+" tick marks at censoring times
# - Annotated log-rank p-value (p = 0.0013)
# - "Number at risk" table beneath the main panel at times 0, 250, 500, 750, 1000
Difficulty: Intermediate
Click to reveal solution
Explanation: ggsurvplot() wraps a survfit object into a ggplot with all the trimmings clinical journals require: confidence bands, censoring marks, embedded p-value, and the risk table below. Setting pval = TRUE auto-runs a log-rank test and prints the result; pass pval.method = TRUE to also label which test was used. The risk table is rendered as a separate ggplot below the main panel and aligns by x-axis. Avoid conf.int = TRUE when overlaying many groups, as the shading collides.
Exercise 6.3: Predict 1-year survival probability for new patient profiles
Task: Predict 1-year survival probability for two hypothetical lung cancer patients using ex_3_2: patient A is a 60-year-old male with ECOG 1, and patient B is a 70-year-old female with ECOG 0. Use survfit() on the Cox model with newdata to obtain survival curves, then read off the survival probability at t = 365. Save the predicted probabilities as ex_6_3.
Expected result:
#> # A tibble: 2 x 2
#> patient prob_1yr
#> <chr> <dbl>
#> 1 A 0.343
#> 2 B 0.620
Difficulty: Advanced
Click to reveal solution
Explanation: survfit(coxph_model, newdata = ...) returns the Breslow-estimated survival curve for each new covariate profile, combining the baseline cumulative hazard with the linear predictor. summary(sf, times = 365) evaluates the curve at one year. Despite patient B being older, her female sex and lower ECOG drive a much higher predicted 1-year survival (62% vs 34%). This is exactly the calculation a clinician runs to quote individualized prognosis to a patient at consultation.
What to do next
Now that you have a structural map of survival analysis in R, pick the next hub that fits your workflow:
- Logistic Regression Exercises for binary outcomes that survival models cannot handle.
- ggplot2 Exercises to sharpen the figures you produce around any survival fit.
- dplyr Exercises for the data-shaping steps that always precede a
coxph()call. - ANOVA Exercises for hypothesis testing on continuous outcomes alongside the log-rank.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
Survival Analysis Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
164 learners have earned this certificate