Multiple Testing Correction
When you run many statistical tests at once, some will look significant by chance alone. Bonferroni, Holm, BH (FDR), and BY adjust the p-values to control this. Paste your p-value vector, pick a method, and see which results survive correction, with the math worked out step by step.
New to multiple comparisons? Read the 4-min primer ▾
What it is. Run one statistical test, you have a 5% chance of a false alarm by luck. Run twenty tests, you'd expect about one. Run a thousand, expect fifty. Multiple testing correction tightens the rule for “significant” so the chance of being fooled stays under control no matter how many tests are in the family.
How to read it. Two error rates. Family-wise error rate (FWER): the probability of even one false positive across the family - controlled by Bonferroni and Holm. False discovery rate (FDR): the expected fraction of false positives among the calls you make significant - controlled by BH and BY. FWER is strict; FDR is more powerful for screens.
The recipe. Sort your m p-values: p(1) ≤ p(2) ≤ … ≤ p(m). Bonferroni: reject if pi ≤ α/m. Holm: walk up the sorted list, reject while p(i) ≤ α/(m−i+1). BH: walk down from the largest, reject while p(i) ≤ (i/m)·α. BY: same as BH but with an extra factor c(m) = Σ 1/k for dependent tests.
Picking the method. Pre-registered confirmatory tests where any false positive matters? Bonferroni or Holm. Genomics, brain imaging, large screens? BH. Tests are correlated and you still want FDR? BY. Decide before peeking at the corrected results - choosing the method to maximise survivors is p-hacking.
Try a real-world example to load.
Twenty differentially-expressed genes from an RNA-seq screen. Most are null; we want FDR-controlled discoveries.
Read more Anatomy of multiple-testing correction
c(m) bakes in worst-case dependence between tests - useful when test statistics are correlated (genes in the same pathway, voxels in the same region). Slightly less powerful than BH; valid under arbitrary dependence.Caveats When this is the wrong tool
- If you have…
- Use instead
- One primary endpoint and many secondary tests
- Use a fixed-sequence or hierarchical procedure (Maurer–Bretz). A blanket correction over the whole family wastes power.
- Hierarchical / structured tests (e.g., genomic regions, brain ROIs)
- Group-FDR or hierarchical FDR (Yekutieli, Heller) - better power for structured families than flat BH.
- Highly dependent tests (linkage, time series, repeated measures)
- BY (FDR under dependence) or a permutation-based correction. BH assumes weak dependence.
- Sequential / interim looks at a trial
- Group-sequential or alpha-spending designs (O'Brien–Fleming, Pocock). Single-look correction is the wrong model.
- Storey q-value with bootstrap π̂0 smoothing
- This tool ships Bonferroni / Holm / BH / BY. For full smoothing or bootstrap π0, use
qvalue::qvalue()in R. - You're picking the method after seeing the corrected p-values
- Stop. Pick the method first; data go through it. The other order is p-hacking.
- The multiple-testing problem, explained - why “did you control for multiplicity?” ends careers.
- False discovery rate, intuitively - what BH actually does, and why it's the genomics default.
- What a p-value really means - especially relevant when you have many of them.
- Confidence interval calculator - for the inverse problem: how precise is each individual estimate?
Numerical accuracy: adjusted p-values match R's p.adjust() to machine precision; cross-checked against p.adjust(method=c("bonferroni","holm","BH","BY")) over ≥ 30 input vectors.