Repeated Measures Exercises in R: 8 Within-Subjects Design Problems, Solved Step-by-Step)
Eight repeated measures ANOVA exercises that put every within-subjects tool in base R through its paces. You simulate within-subjects data, fit one- and two-way RM models with aov() and Error(), run Mauchly's test, apply the Greenhouse-Geisser correction by hand, run paired post-hoc comparisons, compute partial eta squared, and finish on a mixed design using the built-in ChickWeight dataset.
What changes when the same subject is measured multiple times?
Ordinary one-way ANOVA assumes independent observations. The moment the same subject contributes to every condition, that assumption breaks: a subject's baseline follows them through every measurement, and the between-subject noise hides real effects. Repeated measures ANOVA separates that subject-level variability from the error, giving a sharper test of the within-subjects factor. The base R tool for this is aov() paired with an Error() term, and the fastest way to see the payoff is to fit both models on the same data.
Both models agree that time matters, but the correct RM ANOVA produces F = 55.1 versus the naive ANOVA's F = 15.9 on the same 30 observations. That is a 3.5-fold jump in the F-statistic, because the RM model strips out the 158.8 sum-of-squares of between-subject variation (people have different baselines) before computing the F-test. That between-subject variation is not error in a within-subjects question, and leaving it in the denominator wastes power. The p-values tell the same story: 3e-05 (naive) versus 4.5e-08 (RM).
Error(subject/time) term tells aov() to partition variability into between-subject (the Error: subject stratum) and within-subject (the Error: subject:time stratum). The within-subjects F-test is calculated only against within-subject variability, which is why it is more powerful for the same n.Try it: From rm_fit, extract the F-value for the time effect into ex_f.
Click to reveal solution
Explanation: The within-subjects stratum is named using a colon between the grouping variables. Inside it, the first list element is the ANOVA table for the time term. Indexing "F value" and taking the first row picks the term's F; the second row is NA (residuals have no F). This pattern generalises: swap the stratum name and the term index to pull any F from any split-plot design.
How do you check the sphericity assumption?
Repeated measures ANOVA assumes sphericity: the variances of the differences between every pair of within-subject levels are equal. When it fails, the within-subject F inflates and you chase false positives. Mauchly's test is the classical check, and it runs on a multivariate linear model fit to the wide-format data. When it rejects, the Greenhouse-Geisser correction shrinks the degrees of freedom by a factor ε (epsilon) to recover a valid p-value, and you can compute ε by hand from the covariance matrix.
Mauchly's W sits at 0.86 with p = 0.56. The test fails to reject the sphericity null, so there is no evidence the variance-of-differences structure is lopsided. In practice this means the uncorrected within-subjects F from rm_fit is already valid, and no Greenhouse-Geisser correction is needed. W close to 1 means "spherical"; W close to 0 means "badly non-spherical." The p-value uses a chi-squared approximation that is underpowered on small samples, so the W itself is often more informative than the p.
Try it: Extract Mauchly's W statistic from mauchly_res into ex_w.
Click to reveal solution
Explanation: mauchly.test() returns an htest object with the same structure as t.test() and friends: statistic holds W, p.value holds the sphericity p, and parameter holds the Mauchly df (not the ANOVA df). The statistic is a named numeric, so unname(ex_w) gives a bare number if needed.
Practice Exercises
Eight problems, progressive difficulty. Each starter block is runnable as-is so you can iterate; solutions are collapsed. Use distinct variable names (my_*, two_*, cw_*) so your work does not overwrite tutorial objects like rm_fit or mlm.
Exercise 1: Simulate a balanced within-subjects dataset
Build a within-subjects dataset with 8 subjects × 4 time points, using set.seed(7) for reproducibility. Each subject should have their own random baseline (SD = 5), and the time means should grow: 50, 53, 56, 60 at T1, T2, T3, T4. Add within-subject noise with SD = 3. Return my_long (long format with columns subject, time, score) and my_wide (wide matrix with one column per time).
Click to reveal solution
Explanation: The baseline draw stays constant across a subject's four rows, so subjects keep their individual offset. The time means grow across T1 to T4, and independent noise at each time point keeps the within-subject variability realistic. Wide format feeds lm(cbind(T1, T2, T3, T4) ~ 1) for Mauchly's test; long format feeds aov(score ~ time + Error(subject/time)). You use both in the next exercises.
Exercise 2: Fit a one-way repeated measures ANOVA
Fit my_fit <- aov(score ~ time + Error(subject/time), data = my_long) and save summary(my_fit) to my_summary. Print it and pull out the within-subject F-value for time into my_F and the p-value into my_p.
Click to reveal solution
Explanation: Time has a strong effect: F(3, 21) = 18.3, p = 4.6e-06. The Error: subject stratum absorbed 535.2 SS of between-subject variability, which a naive one-way would have left in the denominator. Exercise 3 checks whether this within-subject F is valid, i.e. whether sphericity holds on this 4-level factor.
Exercise 3: Test sphericity with Mauchly
Fit a multivariate linear model on my_wide and run Mauchly's test against ~time. Save the output to my_mauchly and extract W into my_W and the p-value into my_mauchly_p. Interpret: does sphericity hold?
Click to reveal solution
Explanation: W = 0.41 looks low, but the Mauchly p-value is 0.52, so with only 8 subjects the test fails to reject the sphericity null. That is the classic small-sample behaviour: W wanders far from 1 by chance, but the chi-squared approximation is too conservative to flag it. On balance the uncorrected F from Exercise 2 is plausible, but Exercise 4 computes the Greenhouse-Geisser ε as a second line of defence.
Exercise 4: Greenhouse-Geisser correction by hand
Compute the Greenhouse-Geisser ε manually from the covariance matrix of my_wide, apply it to the df of the within-subject F-test from Exercise 2, and re-compute the GG-adjusted p-value. Save ε to eps_gg and the corrected p to gg_p.
The formula uses a centred covariance. Let $S$ be the $k \times k$ sample covariance of the $k$ within-subject levels, and let $C$ be a $k \times (k-1)$ orthonormal contrast matrix (for example, the normalised Helmert contrasts). Then
$$\hat{\varepsilon}_{GG} = \frac{\left(\operatorname{tr}(C^{\prime} S C)\right)^2}{(k-1) \cdot \operatorname{tr}((C^{\prime} S C)^2)}$$
Where $\operatorname{tr}(M)$ is the trace of matrix $M$. Correct the F-test by multiplying both df by $\hat{\varepsilon}_{GG}$ and re-evaluating pf().
Click to reveal solution
Explanation: ε = 0.70 is a moderate shrinkage. After multiplying both numerator and denominator df by ε, the F stays the same but the corrected p-value rises from 4.6e-06 to 1.0e-04. Still overwhelmingly significant here, but on a borderline effect, correcting or not correcting can flip the verdict. The formula via Helmert contrasts is one of several equivalent choices; any orthonormal basis for the within-subject contrasts gives the same ε.
Exercise 5: Pairwise paired comparisons with Holm adjustment
Run pairwise.t.test() on my_long$score and my_long$time with paired = TRUE and p.adj = "holm". Save to my_pht. Which pairs differ at α = 0.05?
Click to reveal solution
Explanation: T4 is significantly higher than T1, T2, and T3 (all adjusted p < 0.002). T3 differs from T1 (p = 0.012). Adjacent early pairs (T1 vs T2 and T2 vs T3) do not clear the 0.05 line after Holm. The pattern matches the simulated means (50, 53, 56, 60): the gaps that exceed noise are the big ones. Holm is uniformly more powerful than Bonferroni at the same familywise error rate, which is why it is the recommended default for paired post-hoc.
Exercise 6: Two-way within-subjects ANOVA
Simulate a 2 × 3 within-subjects design: 8 subjects crossed with drug (placebo/active) and dose (low/med/high), where the drug effect is 3, the dose effect grows linearly (0, 2, 4), and the drug-by-dose interaction adds an extra (0, 1, 2) to the active drug's doses. Use set.seed(11). Save the long data to two_long and the fit to two_fit <- aov(y ~ drug*dose + Error(subject/(drug*dose)), data = two_long).
Click to reveal solution
Explanation: Drug and dose each have their own within-subject stratum (subject:drug and subject:dose), and the interaction has a third (subject:drug:dose). Drug is highly significant (F(1,7) = 23.9, p = 0.002), dose even more so (F(2,14) = 18.5, p = 0.0001), and the drug-by-dose interaction is borderline (p = 0.10). The small interaction is the simulated (0, 1, 2) extra pushing active-drug scores up a bit more at high dose, but it is swamped by noise at n = 8. Exercise 7 reports effect sizes to quantify each effect's magnitude.
Exercise 7: Partial eta squared from the two-way fit
Extract the sum-of-squares columns from two_fit and compute partial eta squared for drug, dose, and drug:dose. Save a tidy data frame eta_df with columns effect, ss_effect, ss_error, eta2_p. The formula is $\eta^2_p = \mathrm{SS}_{\text{effect}} / (\mathrm{SS}_{\text{effect}} + \mathrm{SS}_{\text{error}})$, where ss_error is the residual SS of the stratum containing that effect.
Click to reveal solution
Explanation: The drug main effect's partial η² = 0.77 is very large (conventional thresholds: 0.01 small, 0.06 medium, 0.14 large). Dose is similarly large at 0.73. The borderline interaction has partial η² = 0.28, which is medium-to-large in magnitude but was not flagged at p < 0.05 because the stratum has only 14 residual degrees of freedom. Reporting effect sizes alongside p-values tells the full story: a non-significant interaction with partial η² = 0.28 likely reflects low power, not no effect.
rstatix report generalised η² by default because it is more comparable across studies with different designs. Partial η² is the base-R standby and is easier to compute from aov() output. Report whichever convention your journal expects; state which one you used.Exercise 8: Mixed design on ChickWeight
Use the built-in ChickWeight dataset, which measures chick weight in grams across Time = 0, 2, ..., 21 days, Diet ∈ {1, 2, 3, 4}, and one row per chick-time combination. Subset to Time in c(0, 10, 20) and keep only chicks with complete triplets. Fit a mixed design: Diet as between-subjects, Time as within-subjects. Save the subset to cw_sub and the fit to cw_fit. Report the Diet F, the Time F, and the Diet:Time interaction F.
Click to reveal solution
Explanation: The between-subjects effect of Diet (main effect averaged across Time) is borderline (F(3,41) = 2.4, p = 0.085). The within-subjects effect of Time is massive (F(2,82) = 399, p < 2e-16): chicks grow from roughly 40 g at day 0 to 170+ g at day 20 regardless of diet. The significant Diet:Time interaction (F(6,82) = 10.7, p = 1.8e-09) says growth curves DIFFER by diet, which is the scientifically interesting finding: diet effects emerge over time, not at day 0. A follow-up would plot the four growth curves and run simple effects tests at day 20.
Complete Example: caffeine memory study (end-to-end pipeline)
End-to-end RM ANOVA on simulated memory-recall data. Twelve subjects each drink placebo, 100mg caffeine, and 200mg caffeine (within), with recall tested before and after (within). The design is 3 × 2 within-subjects. The walkthrough covers: simulate, reshape, fit, check sphericity, correct, post-hoc, effect size, report.
A one-paragraph report writes itself: "A 3 (dose: placebo, 100mg, 200mg) × 2 (phase: pre, post) within-subjects ANOVA on recall scores (n = 12) revealed significant main effects of dose (F(2, 22) = 10.8, p < 0.001, partial η² = 0.50) and phase (F(1, 11) = 78.7, p < 0.001, partial η² = 0.88), with a significant dose-by-phase interaction (F(2, 22) = 5.5, p = 0.010, partial η² = 0.50). Mauchly's test on the dose factor was non-significant (W = 0.94, p = 0.78), so no Greenhouse-Geisser correction was applied. Holm-adjusted paired comparisons confirmed that 200mg boosted recall over both placebo (p < 0.001) and 100mg (p = 0.003); the placebo-versus-100mg contrast was marginal (p = 0.07)."
Summary
- Use
aov(y ~ within + Error(subject/within))for one-way within-subjects designs. TheError()term partitions between-subject variability out of the denominator, boosting the F-test's power. - For two-way within-subjects, use
Error(subject/(A*B)). Each effect gets its own within-subject stratum. - For mixed designs, use
Error(subject/within). Between-subjects effects land in theError: subjectstratum; within-subjects effects land inError: subject:within. - Check sphericity with
mauchly.test()on a multivariatelmof the wide-format data. Mauchly is underpowered on small n and overpowered on large n, so treat its p-value as advisory, not definitive. - When sphericity is in doubt, apply Greenhouse-Geisser by multiplying both F-test df by ε (formula: $\hat{\varepsilon}_{GG} = (\operatorname{tr}(C^{\prime} S C))^2 / ((k-1) \operatorname{tr}((C^{\prime} S C)^2))$). Re-evaluate with
pf()on the corrected df. - Post-hoc pairs:
pairwise.t.test(..., paired = TRUE, p.adj = "holm"). Holm dominates Bonferroni on power at the same error-rate guarantee. - Effect size: partial η² = SS_effect / (SS_effect + SS_error), where SS_error is the residual of the SAME stratum as the effect. Always report alongside the F and p.
| Task | Base-R tool |
|---|---|
| Fit one-way within-subjects RM | aov(y ~ A + Error(subject/A)) |
| Fit two-way within-subjects | aov(y ~ A*B + Error(subject/(A*B))) |
| Fit mixed (between × within) | aov(y ~ B*W + Error(subject/W)) |
| Check sphericity | mauchly.test(mlm, M = ~W, X = ~1, idata = ...) |
| Correct for sphericity | GG ε from contrast-transformed cov, multiply df by ε |
| Post-hoc paired | pairwise.t.test(..., paired = TRUE, p.adj = "holm") |
| Effect size | $\eta^2_p = \mathrm{SS}_{\text{effect}} / (\mathrm{SS}_{\text{effect}} + \mathrm{SS}_{\text{error}})$ |
Error(subject/within) and the rest falls into place.References
- R Core Team. aov() reference manual. stat.ethz.ch/R-manual/R-devel/library/stats/html/aov.html
- R Core Team. mauchly.test() reference manual. stat.ethz.ch/R-manual/R-devel/library/stats/html/mauchly.test.html
- Greenhouse, S. W., & Geisser, S. (1959). On methods in the analysis of profile data. Psychometrika 24:95-112. doi.org/10.1007/BF02289823
- Huynh, H., & Feldt, L. S. (1976). Estimation of the Box correction for degrees of freedom from sample data in randomized block and split-plot designs. Journal of Educational Statistics 1(1):69-82. doi.org/10.3102/10769986001001069
- Maxwell, S. E., Delaney, H. D., & Kelley, K. (2018). Designing Experiments and Analyzing Data, 3rd ed. Routledge. Chapter 11: One-Way Within-Subjects Designs.
- Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. SAGE. Chapter 13: Repeated-Measures Designs.
- Bakeman, R. (2005). Recommended effect size statistics for repeated measures designs. Behavior Research Methods 37(3):379-384. doi.org/10.3758/BF03192707
Continue Learning
- Repeated Measures ANOVA in R: the conceptual parent post covering sphericity, Mauchly's test, and Greenhouse-Geisser with worked examples before you hit the exercises.
- ANOVA Exercises in R: sister exercise set for between-subjects one-way and factorial designs, so you can compare within- and between-subjects F-tests side by side.
- Post-Hoc Tests Exercises in R: Tukey, Bonferroni, Holm, and Wilcoxon follow-ups for when the ANOVA fires and you need to identify which groups differ.