Multiple Testing Exercises in R: 8 FDR & Bonferroni Problems, Solved Step-by-Step)
These 8 multiple testing exercises in R walk you through Bonferroni, Holm, and Benjamini-Hochberg FDR corrections using base R's p.adjust(), starting from hand calculations on tiny vectors and building up to a genomics-style workflow with hundreds of p-values, every problem solved step-by-step and runnable in the browser.
How does p.adjust() transform a vector of p-values?
Every correction method in these exercises funnels through one function: p.adjust(). You give it a vector of raw p-values, pick a method, and it returns an adjusted vector of the same length. The decision rule after that never changes: reject any hypothesis whose adjusted p-value falls below your threshold (alpha for FWER methods, your chosen FDR target for BH). One function, six methods, one decision rule is the whole toolkit these exercises drill. The fastest way to see the differences is to run all six methods on the same input at once.
Read the table left to right and the philosophy of each method jumps out. Bonferroni is the strictest: it multiplies every raw p-value by m (here 10), so a raw p of 0.04 becomes 0.40 and loses significance. Holm is uniformly less conservative than Bonferroni but still controls FWER. BH is visibly looser because it controls a different error rate, the expected proportion of false rejections among discoveries. At alpha = 0.05 the Bonferroni column rejects 2 hypotheses, Holm rejects 2, BH rejects 5. Same raw data, different philosophies, different counts.
p_adj <- p.adjust(raw_p, method = "..."), then sum(p_adj < threshold). The only thing that changes is which method matches your error-rate goal and what threshold you compare against.Try it: Compute Bonferroni-adjusted p-values for c(0.001, 0.01, 0.04, 0.20, 0.50) by hand using the rule min(1, p * m), then verify with p.adjust(..., method = "bonferroni"). Save both vectors and check they match.
Click to reveal solution
Explanation: Bonferroni is the simplest of all corrections. Multiply every raw p-value by m, the number of tests, and cap at 1. pmin(x, 1) is the vectorised cap. Running this by hand cements the idea that "corrected p-value" is just a rescaled raw p-value, not some mysterious transformation.
How do you count discoveries after correction?
Once you have a vector of adjusted p-values, the question every analyst asks is "how many did I find?" The answer is a one-liner: sum(p_adj < threshold). For FWER methods the threshold is alpha (usually 0.05). For BH the threshold is your FDR target (often 0.05 or 0.10, depending on tolerance for false discoveries). A quick simulation shows the difference between raw and corrected counts.
The raw count of 15 is inflated: 10 of those came from the genuinely small-p alternative block, and 5 came from the 90 null p-values that happened to dip below 0.05 by chance. BH correctly pulls the count down to 10, matching the number of real signals in the simulation. The mental model: raw p-values answer "could this one test be significant?" while adjusted p-values answer "could this test be significant given I ran m of them?"
sum(p_adj < target). This single idiom works for every correction method. Swap "BH" for "bonferroni" or "holm" and the counting line does not change, which is why it recurs in every exercise below.Try it: The vector ex_pv below simulates 50 p-values. Apply BH and count how many survive at an FDR target of 0.10. Save the count to ex_bh_count.
Click to reveal solution
Explanation: The 10 "true alternatives" were drawn from Uniform(0, 0.02), so all 10 raw p-values are tiny and survive BH easily. None of the 40 null p-values dip low enough to pass the BH step-up threshold at FDR = 0.10, so the final count exactly matches the number of real signals.
Practice Exercises
Eight capstone problems, ordered from hand-calculation drills to multi-method comparisons to applied scenarios. Each exercise uses a distinct ex1_ to ex8_ prefix so your solution variables do not clobber earlier state.
Exercise 1: Bonferroni by hand on six p-values
You are told a colleague ran 6 tests and the raw p-values are c(0.001, 0.008, 0.039, 0.041, 0.06, 0.2). Compute the Bonferroni-adjusted p-values two ways: (a) by hand using p * 6 capped at 1, and (b) with p.adjust(..., method = "bonferroni"). Count how many hypotheses survive at alpha = 0.05.
Click to reveal solution
Explanation: Only two hypotheses survive Bonferroni at alpha = 0.05: the raw p-values of 0.001 and 0.008. The test at p = 0.039 had a plausible raw signal on its own but loses significance once multiplied by 6. Bonferroni is conservative precisely because it treats every test as if it might be the one false alarm you care about.
Exercise 2: Holm step-down on the same vector
Using the same vector ex1_p, apply p.adjust(..., method = "holm"). How many hypotheses survive at alpha = 0.05? Compare the Holm output element-by-element with Bonferroni and identify any p-value where the two methods differ.
Click to reveal solution
Explanation: Holm and Bonferroni agree on the smallest p-value (both give 0.006) but Holm uses a smaller multiplier on the second smallest (m - 1 = 5 instead of m = 6), giving 0.040 instead of 0.048. At alpha = 0.05 both methods reject the same two hypotheses, but notice Holm's adjusted p-values never exceed Bonferroni's. Holm is uniformly less conservative while still controlling FWER, which is why standard practice recommends Holm over plain Bonferroni for FWER control.
Exercise 3: Benjamini-Hochberg FDR, by hand and with p.adjust
Apply BH correction to ex1_p with FDR target q = 0.10. Verify the result manually: sort the p-values, compute the critical line (k/m) * q for k = 1..m, and find the largest k where p(k) <= critical(k). Reject all p-values with rank less than or equal to that k.
Click to reveal solution
Explanation: Walking up the sorted p-values, every one up to rank 5 sits at or below its critical line (k/m) * 0.10. The largest such k is 5, so BH rejects the 5 smallest p-values. p.adjust() gets the same answer, confirming our manual calculation. Compare to Exercise 1 where Bonferroni only rejected 2 at alpha = 0.05: BH trades a stricter error rate for higher power.
p.adjust() enforces that the sequence of sorted adjusted p-values is non-decreasing, which prevents rank reversals that would otherwise allow a larger raw p-value to look "more significant" than a smaller one after adjustment.Exercise 4: Compare Bonferroni, Holm, and BH on 20 p-values
Given the 20 p-values in the starter, build a data frame with columns raw, bonf, holm, BH, sorted by raw p-value. Show the first 10 rows. Report the number of discoveries at alpha = 0.05 for each method.
Click to reveal solution
Explanation: BH finds all 5 true signals that we planted in the runif(5, 0, 0.01) block. Holm catches 3 of them, Bonferroni catches only 2. The ranking BH >= Holm >= Bonferroni in discovery count almost always holds, because BH controls a looser error rate (FDR) while the other two control FWER. Which method you use depends on the cost of a false discovery versus the cost of missing a real one.
Exercise 5: Genomic-style simulation with 1000 p-values
Simulate a "high-throughput" scenario: 950 null p-values from Uniform(0, 1) and 50 true alternatives from Beta(0.5, 50), which pushes mass toward zero. Apply BH at FDR = 0.05. Report the total number of discoveries and, using the true label vector, how many are true positives.
Click to reveal solution
Explanation: BH flags 48 hypotheses as significant, 47 of them are true alternatives and 1 is a null that got caught in the net. The realised false discovery proportion is 1/48 approx 2.1%, comfortably below the 5% FDR target. This is exactly what BH promises: the expected proportion of false rejections among discoveries is at most q = 0.05.
Exercise 6: FDR vs FWER tradeoff table
Using the Ex5 vector ex5_p and the ground truth ex5_is_alt, build a 3-row table comparing Bonferroni, Holm, and BH. Columns: method, discoveries, true_positives, false_positives, power. Define power as true_positives / 50 (the number of real alternatives).
Click to reveal solution
Explanation: Bonferroni and Holm each produce zero false positives because they control the family-wise error rate, but they pay for it: power drops from 96% (BH) to around 50%. BH doubles the discovery count while letting through only one false positive, exactly the tradeoff its name ("false discovery rate") advertises. Which method you want depends on the cost of missing a real effect compared to the cost of chasing a false one.
Exercise 7: Pairwise comparisons with pairwise.t.test
Using the iris dataset, run pairwise.t.test() comparing Sepal.Length across Species two ways: once with p.adjust.method = "holm" (the default) and once with p.adjust.method = "none". Extract the $p.value matrix from each. How do the adjusted and unadjusted matrices differ?
Click to reveal solution
Explanation: There are m = 3 pairwise comparisons here, all with extremely small raw p-values, so even a strict Holm correction barely changes them. The only visible change is in the setosa-vs-virginica cell: raw 1.1e-25 becomes 2.2e-25 under Holm (multiplied by m - 0 = 2 according to the step-down rules). When raw p-values are this small the choice of correction method matters little, but with borderline p-values around 0.01-0.05 the adjustment makes or breaks the decision.
NA. Feed the matrix to as.dist() if you need a compact vector for further processing.Exercise 8: A/B test across 15 variants
You ran 15 simultaneous A/B tests against a control with baseline conversion 8%. Fourteen variants perform at baseline (n = 2000 each, null); one variant has a real +2 percentage-point lift (n = 2000). Generate the data, run a two-proportion test per variant, collect p-values, and apply BH at FDR = 0.05. Report which variants are declared winners.
Click to reveal solution
Explanation: BH correctly identifies variant 7 as the single winner. Its raw p-value of 0.0015 becomes a BH-adjusted 0.0225, safely under the 0.05 FDR cutoff. None of the 14 null variants cross the threshold, despite variant 6 having a raw p of 0.196 (nowhere near significant even before correction). This exercise mirrors the "15 experiments, one winner" problem real A/B platforms solve every day: run every test, correct once at the end, and only ship the variants whose adjusted p-values survive.
Complete Example: Gene-expression-style workflow
A realistic application ties everything together. Simulate 500 p-values from a mixture of 450 nulls and 50 alternatives, visualise the raw distribution, apply BH, and plot the decision against the step-up critical line. This is the standard workflow in genomics, neuroimaging, and any high-dimensional testing problem.
The histogram shows the diagnostic signature of a multiple-testing problem: a flat pedestal across [0, 1] from the 450 null p-values, plus a spike near zero from the 50 true alternatives. BH correctly finds 42 of the 50 signals at an FDR of 5%. The remaining 8 alternatives have raw p-values that blend into the null distribution, so no correction method could separate them from the 450 nulls with certainty.
Zooming into the first 100 ranks shows the geometry that makes BH tick. The sorted p-values (points) climb slowly near the origin, then shoot up once you run out of real signal. The BH critical line (k/m)*q is a straight line from the origin with slope q/m. BH rejects every rank up to the rightmost point that sits at or below the line: here that is rank 42, marked by the dashed vertical. It is a fundamentally different decision rule from Bonferroni, which would draw a horizontal line at alpha/m = 0.0001 and reject only the handful of points below it.
Summary
p.adjust() is the one function that powers every correction method in R. Swap the method argument, keep the decision rule, and you move between FWER and FDR control with a single line change.
| Method | Type | Strictness | When to use |
|---|---|---|---|
bonferroni |
FWER | Strictest | Few tests, very low false-positive tolerance |
holm |
FWER | Uniformly better than Bonferroni | Default FWER choice |
hochberg |
FWER | Assumes independent or positively correlated tests | When tests are clearly independent |
hommel |
FWER | Most powerful FWER, slow | Small m, independence |
BH |
FDR | Standard FDR | High-throughput (genomics, A/B farms) |
BY |
FDR | Conservative FDR under dependence | Correlated tests where BH is too loose |
References
- Benjamini, Y. & Hochberg, Y., Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, JRSS-B (1995). Link
- Holm, S., A Simple Sequentially Rejective Multiple Test Procedure, Scandinavian Journal of Statistics (1979). Link
- R Core Team,
p.adjust()reference documentation,statspackage. Link - Goeman, J. & Solari, A., Multiple Hypothesis Testing in Genomics, Statistics in Medicine (2014). Link
- Efron, B., Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Cambridge University Press (2010). Link
- Storey, J. D. & Tibshirani, R., Statistical Significance for Genomewide Studies, PNAS (2003). Link
Continue Learning
- Multiple Testing in R, the parent tutorial on why FWER inflates, the theory behind BH, and every
p.adjust()method explained in depth. - Hypothesis Testing Exercises in R, 12 drills on single-test p-values, the building blocks of multiple testing.
- t-Test Exercises in R, paired, two-sample, and one-sample t-test practice, a natural feeder into pairwise comparison problems.