Rr‑statistics.co

Multiple Testing Correction

When you run many statistical tests at once, some will look significant by chance alone. Bonferroni, Holm, BH (FDR), and BY adjust the p-values to control this. Paste your p-value vector, pick a method, and see which results survive correction, with the math worked out step by step.

i New to multiple comparisons? Read the 4-min primer

What it is. Run one statistical test, you have a 5% chance of a false alarm by luck. Run twenty tests, you'd expect about one. Run a thousand, expect fifty. Multiple testing correction tightens the rule for “significant” so the chance of being fooled stays under control no matter how many tests are in the family.

How to read it. Two error rates. Family-wise error rate (FWER): the probability of even one false positive across the family - controlled by Bonferroni and Holm. False discovery rate (FDR): the expected fraction of false positives among the calls you make significant - controlled by BH and BY. FWER is strict; FDR is more powerful for screens.

The recipe. Sort your m p-values: p(1) ≤ p(2) ≤ … ≤ p(m). Bonferroni: reject if pi ≤ α/m. Holm: walk up the sorted list, reject while p(i) ≤ α/(m−i+1). BH: walk down from the largest, reject while p(i) ≤ (i/m)·α. BY: same as BH but with an extra factor c(m) = Σ 1/k for dependent tests.

Picking the method. Pre-registered confirmatory tests where any false positive matters? Bonferroni or Holm. Genomics, brain imaging, large screens? BH. Tests are correlated and you still want FDR? BY. Decide before peeking at the corrected results - choosing the method to maximise survivors is p-hacking.

4 methods · one tool · Bonferroni · Holm · BH · BY · Runs in your browser

Try a real-world example to load.

🧬 RNA-seq 20 genes

Twenty differentially-expressed genes from an RNA-seq screen. Most are null; we want FDR-controlled discoveries.

REJECTED AFTER CORRECTION
- / -
R code RUNNABLE
R Reproduce in R

        
P-value adjustment INTERACTIVE
Inference

Read more Anatomy of multiple-testing correction
Bonferroni: adj_p_(i) = min(1, m · p_(i)) reject if adj_p ≤ α
Bonferroni. Multiply every p-value by the family size m, cap at 1. Controls the family-wise error rate strictly. The simplest correction; also the most conservative - it leaves real findings on the table when m is large.
Holm (step-down): sort p-values p_(1) ≤ … ≤ p_(m) adj_p_(i) = max over j≤i of (m − j + 1) · p_(j) capped at 1, monotone
Holm. Walk up the sorted p-values, scaling each by a smaller factor than Bonferroni. Same FWER guarantee, uniformly more powerful - rejects every test Bonferroni rejects, and sometimes more.
BH (Benjamini–Hochberg, FDR): adj_p_(i) = min over j≥i of (m / j) · p_(j) reject if adj_p ≤ α
BH. The default for genomics, brain imaging, screens. Controls the expected false discovery rate (the fraction of significant calls that are false). Far more powerful than FWER methods when m is large and many true alternatives exist.
BY (Benjamini–Yekutieli): c(m) = Σ 1/k for k = 1..m adj_p_(i) = min over j≥i of (m · c(m) / j) · p_(j)
BY. The robust cousin of BH. The harmonic factor c(m) bakes in worst-case dependence between tests - useful when test statistics are correlated (genes in the same pathway, voxels in the same region). Slightly less powerful than BH; valid under arbitrary dependence.
Caveats When this is the wrong tool
If you have…
Use instead
One primary endpoint and many secondary tests
Use a fixed-sequence or hierarchical procedure (Maurer–Bretz). A blanket correction over the whole family wastes power.
Hierarchical / structured tests (e.g., genomic regions, brain ROIs)
Group-FDR or hierarchical FDR (Yekutieli, Heller) - better power for structured families than flat BH.
Highly dependent tests (linkage, time series, repeated measures)
BY (FDR under dependence) or a permutation-based correction. BH assumes weak dependence.
Sequential / interim looks at a trial
Group-sequential or alpha-spending designs (O'Brien–Fleming, Pocock). Single-look correction is the wrong model.
Storey q-value with bootstrap π̂0 smoothing
This tool ships Bonferroni / Holm / BH / BY. For full smoothing or bootstrap π0, use qvalue::qvalue() in R.
You're picking the method after seeing the corrected p-values
Stop. Pick the method first; data go through it. The other order is p-hacking.
Further reading

Numerical accuracy: adjusted p-values match R's p.adjust() to machine precision; cross-checked against p.adjust(method=c("bonferroni","holm","BH","BY")) over ≥ 30 input vectors.