Post-Hoc Tests Exercises in R: 8 Tukey & Bonferroni Problems, Solved Step-by-Step
Eight post-hoc exercises in R that turn a significant ANOVA into specific pairwise verdicts. You run Tukey HSD, Bonferroni, Holm, and nonparametric alternatives on built-in datasets, read the adjusted p-values and confidence intervals, and see exactly when one correction is sharper than another.
Which post-hoc test fits your ANOVA, Tukey or Bonferroni?
A significant ANOVA tells you some group differs, but not which one. The two standard follow-ups in base R are TukeyHSD() and pairwise.t.test() with a Bonferroni correction. Tukey is tailored to all-pairwise comparisons after ANOVA; Bonferroni is more generic and applies anywhere you want to test a family of hypotheses. Both adjust p-values so the overall (familywise) false-positive rate stays near 5% instead of ballooning with each new comparison. Running both on the same data side by side is the quickest way to see how their verdicts differ before you start the exercises.
The ANOVA rejects the null at p = 0.016, confirming at least one pair differs. Tukey narrows it down: only trt2-trt1 clears the 0.05 line with p = 0.012; the other two comparisons sit well above. Bonferroni agrees the trt2-trt1 pair is significant (p = 0.013) and keeps the other two non-significant. The two methods produce slightly different adjusted p-values, but here they reach the same conclusion. That alignment breaks down once you have more groups, which is where picking the right method starts to matter.
Here is the one-line rule for choosing between them:
| Situation | Use |
|---|---|
| Standard all-pairs post-hoc after one-way or factorial ANOVA | TukeyHSD() |
| You only care about a subset of planned pairs, or want a method that works beyond ANOVA (nonparametric, chi-square follow-ups) | pairwise.t.test() with p.adj = "bonferroni" |
| You want Bonferroni's safety but more power | pairwise.t.test() with p.adj = "holm" (Exercise 5) |
| Data is skewed or non-normal | pairwise.wilcox.test() with p.adj = "bonferroni" (Exercise 8) |
Try it: Pull out the adjusted p-value for the trt2-trt1 pair from tukey_pg. Save it to ex_p.
Click to reveal solution
Explanation: TukeyHSD() returns a list with one entry per factor. That entry is a numeric matrix whose rownames are the pairs and whose columns are diff, lwr, upr, and p adj. Indexing [rowname, colname] is the clean way to pull out a single value. This pattern is the basis of Exercise 2.
How do you read TukeyHSD output and confidence intervals?
The Tukey object has four columns worth knowing cold. diff is the mean difference for the pair (second group minus first, based on the rowname). lwr and upr are the lower and upper bounds of the 95% family-wise confidence interval for that difference. p adj is the adjusted p-value. The quickest read: if the CI covers zero, the pair is non-significant at the 0.05 family-wise level; if it lies entirely above or below zero, the pair is significant. Calling plot() on the Tukey object turns that into a visual scan.
The printed Tukey table for PlantGrowth tells the same story: trt1-ctrl (-1.06 to 0.32) and trt2-ctrl (-0.20 to 1.19) both straddle zero, and both have p adj > 0.05. Only trt2-trt1 (0.17 to 1.56) clears zero entirely, matching its p adj = 0.012. The plot is handy when you have many pairs and want to triage which ones are worth writing up before reading p-values.
Try it: From tukey_pg$group, identify the one pair whose CI does NOT cross zero. Save its row name (a string) to ex_nonsig.
Click to reveal solution
Explanation: A CI crosses zero when lwr and upr have opposite signs. Filtering to pairs where sign(lwr) == sign(upr) returns only the intervals that stay on one side of zero, which is equivalent to p adj < 0.05 at the 95% family-wise level.
Practice Exercises
Eight problems, progressive difficulty. Each starter block is runnable as-is so you can iterate; solutions are collapsed. Use distinct variable names (ex1_*, ex2_*, ...) so your work does not overwrite tutorial objects like tukey_pg.
Exercise 1: TukeyHSD on InsectSprays
Fit a one-way ANOVA on the built-in InsectSprays dataset (count ~ spray), save the fit to ex1_fit, and run TukeyHSD on it into ex1_tukey. Print ex1_tukey. InsectSprays has six groups, so you will get 15 pairwise rows.
Click to reveal solution
Explanation: The ANOVA is massively significant (F = 34.7, p < 2e-16) because sprays C, D, and E kill far more insects than A, B, and F. The Tukey table confirms the structure: every pair that crosses the "weak vs strong" boundary is significant, while within-group pairs (A-B, A-F, C-D, C-E, D-E, etc.) are not. This 15-row object powers Exercises 2, 4, and 7.
Exercise 2: Extract the adjusted p-value for a specific pair
From ex1_tukey, pull out the adjusted p-value for the C-A comparison into ex2_pval. This is the skill you use every time a reviewer asks "what is the exact p for spray C vs spray A?"
Click to reveal solution
Explanation: Same matrix-indexing pattern as the first Try it, but with factor name spray and rowname C-A. When a p-value prints as 0.0000000 in the full table, pulling it out as a numeric shows the real magnitude, here roughly 1.5e-08. Rounded display hides the strength of the evidence.
Exercise 3: Pairwise t-tests with Bonferroni adjustment
Run pairwise.t.test() on InsectSprays$count and InsectSprays$spray with p.adj = "bonferroni", save to ex3_bonf, and print it.
Click to reveal solution
Explanation: pairwise.t.test() returns a lower-triangular matrix of adjusted p-values. The structure mirrors Tukey: A/B/F cluster together as "weak sprays" (all mutual p-values = 1.00), C/D/E cluster as "strong sprays" (all mutual p-values = 1.00), and every weak-strong comparison is highly significant. Bonferroni's conservativism is visible in the 1.00 values, which are p-values that got multiplied by the number of comparisons (15) and capped at 1.
Exercise 4: Compare Tukey and Bonferroni side-by-side
Build a data frame with three columns, pair, tukey_p, and bonf_p, showing the adjusted p-value from ex1_tukey and from ex3_bonf for every pair. Save to ex4_tbl. This tells you which method is being more conservative on this dataset.
Click to reveal solution
Explanation: Bonferroni is slightly more conservative on non-significant pairs (every Tukey value above about 0.5 gets pushed to 1.00 by Bonferroni's cap). On the pairs that actually matter, both methods agree, and both agree with the eye. The practical takeaway: Tukey and Bonferroni rarely disagree about significance on ANOVA data. Use Tukey when you want slightly more power for the borderline pairs.
Exercise 5: Holm adjustment, better power with the same guarantee
Holm's step-down procedure controls the same familywise error rate as Bonferroni but is uniformly less conservative. Run pairwise.t.test() again with p.adj = "holm" on InsectSprays, save to ex5_holm, and confirm every Holm p-value is less than or equal to the matching Bonferroni p-value.
Click to reveal solution
Explanation: Every Holm p-value is less than or equal to its Bonferroni counterpart, so Holm never loses power but controls the same familywise error rate. The B-A pair drops from 1.0000 (Bonferroni) to 0.82 (Holm), the small-but-nonzero D-A tightens from 4.5e-06 to 1.5e-06, and so on. If you were ever tempted to use Bonferroni "to be safe," use Holm instead, same guarantee, more power.
Exercise 6: Tukey HSD on a two-way ANOVA
Fit a factorial ANOVA len ~ supp * factor(dose) on ToothGrowth, save to ex6_fit, and run TukeyHSD on it into ex6_tukey. Then pull out the comparison of OJ:0.5 vs VC:0.5 from the interaction term. Supplement effect at the lowest dose is the scientific question.
Click to reveal solution
Explanation: At dose 0.5, VC delivers 5.25 units LESS tooth growth than OJ, with a 95% family-wise CI of [-10.05, -0.45] and p adj = 0.024. That is a significant supplement effect at the lowest dose. TukeyHSD() on a factorial fit returns one block per term (supp, factor(dose), supp:factor(dose)), and the interaction block contains all 15 cross-combinations. You rarely care about all 15, so filter to the specific cell comparison that answers your question.
Exercise 7: Plot the TukeyHSD confidence intervals
Call plot(ex1_tukey, las = 1) to visualise the 15 InsectSprays pairs. las = 1 rotates the axis labels so the rownames are readable. Eyeball the chart: count how many intervals cross zero (non-significant) versus how many sit entirely on one side (significant).
Click to reveal solution
Explanation: Six pairs (B-A, F-A, F-B, D-C, E-C, E-D) have intervals straddling zero, so they are non-significant. The other nine sit entirely on one side of zero and are significant. This matches the p adj column from Exercise 1 perfectly. On a 6-group design you can verify the post-hoc story visually in one glance, which is why plot(TukeyHSD()) is often faster than scanning a 15-row printout.
Exercise 8: Nonparametric post-hoc after Kruskal-Wallis
When the ANOVA assumption of normality fails, the Kruskal-Wallis test replaces aov() and pairwise.wilcox.test() with p.adj = "bonferroni" replaces Tukey. Run both on PlantGrowth (yes, it is normal enough for ANOVA, but the same pipeline applies to any skewed outcome). Save the pairwise result to ex8_wilcox.
Click to reveal solution
Explanation: Kruskal-Wallis gives an overall chi-squared of 7.99 (p = 0.018), close to what ANOVA produced. The pairwise Wilcoxon with Bonferroni flags only trt2-trt1 at p = 0.028, matching Tukey and Bonferroni t-tests. This is the go-to pipeline when your outcome is ordinal, heavy-tailed, or otherwise non-normal: Kruskal-Wallis for the global test, pairwise.wilcox.test() for the post-hoc, Bonferroni or Holm for the adjustment.
Complete Example: chick weights by feed type
End-to-end pipeline on the built-in chickwts dataset, which measures chick weights across six feed types. The workflow below is the same one you would run in a report: ANOVA, Tukey, Bonferroni, Holm, then a visual.
The ANOVA rejects strongly (F = 15.36, p = 5.9e-10). Casein and sunflower are the heavyweight feeds; horsebean is the lightweight. Tukey, Bonferroni, and Holm all agree on the direction of every significant pair, but they trade power and conservativism at the margins: Tukey flags soybean-casein (p = 0.008), Bonferroni matches (p = 0.009), and Holm keeps it (p = 0.006). A borderline pair like meatmeal-casein sits at Tukey p = 0.33 (non-sig) and Bonferroni p = 0.39 (non-sig) with Holm at 0.08, all above 0.05. A one-paragraph report writes itself: "A one-way ANOVA (F(5, 65) = 15.4, p < 0.001) found feed type significantly affected chick weight. Tukey HSD showed casein, sunflower, and meatmeal all produced heavier chicks than horsebean (adjusted p ≤ 0.0001), while casein also beat linseed (p = 0.0002) and soybean (p = 0.008)."
Summary
- After a significant ANOVA,
TukeyHSD()is the default for all pairwise comparisons, andpairwise.t.test()withp.adj = "bonferroni"is the generic alternative that works beyond ANOVA. - The adjusted p-values from Tukey and Bonferroni rarely disagree on ANOVA data, but Bonferroni is slightly more conservative, especially on borderline pairs.
- Holm (
p.adj = "holm") dominates Bonferroni: same familywise error guarantee, more power. Prefer Holm unless you have a specific reason to use Bonferroni. plot(TukeyHSD(fit))turns a 15-row adjusted p-value table into a one-screen visual: any interval crossing zero is non-significant.- For non-normal outcomes, the nonparametric pipeline is
kruskal.test()followed bypairwise.wilcox.test()with Bonferroni or Holm. - On factorial ANOVAs,
TukeyHSD()returns one block per term. The interaction block contains every cell-vs-cell comparison, so filter to the scientific question you actually care about.
| Method | Power | When to use |
|---|---|---|
TukeyHSD() |
Best for all-pairs after ANOVA | One-way or factorial ANOVA, all pairwise comparisons |
pairwise.t.test(..., p.adj = "bonferroni") |
Conservative | Works outside ANOVA (nonparametric, chi-square follow-ups); simplest default |
pairwise.t.test(..., p.adj = "holm") |
Bonferroni's replacement | Same FWER guarantee, uniformly more power |
pairwise.wilcox.test(..., p.adj = "bonferroni") |
Nonparametric | Skewed or ordinal outcome; Kruskal-Wallis follow-up |
References
- R Core Team. TukeyHSD() reference manual. stat.ethz.ch/R-manual/R-devel/library/stats/html/TukeyHSD.html
- R Core Team. pairwise.t.test() reference manual. stat.ethz.ch/R-manual/R-devel/library/stats/html/pairwise.t.test.html
- R Core Team. p.adjust() reference manual (Bonferroni, Holm, Hochberg, BH). stat.ethz.ch/R-manual/R-devel/library/stats/html/p.adjust.html
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6:65-70.
- Hothorn, T., Bretz, F., Westfall, P. (2008). Simultaneous Inference in General Parametric Models. Biometrical Journal 50(3):346-363.
- UCLA OARC. How can I do post-hoc pairwise comparisons in R? stats.oarc.ucla.edu/r/faq/how-can-i-do-post-hoc-pairwise-comparisons-in-r/
- Greenwood, M. (2022). Intermediate Statistics with R, Chapter 3: Multiple pair-wise comparisons using Tukey's HSD. stats.libretexts.org/03%3A_One-Way_ANOVA/3.06%3A_Multiple_(pair-wise)_comparisons_using_Tukeys_HSD_and_the_compact_letter_display)
Continue Learning
- Post-Hoc Tests After ANOVA in R: the conceptual parent post covering when and why Tukey, Bonferroni, and Scheffé apply, with decision rules.
- One-Way ANOVA in R: fit and interpret the ANOVA that these post-hoc tests follow up on.
- ANOVA Exercises in R: sister exercise set focused on fitting one-way and factorial ANOVAs before the post-hoc stage.