When should I use Bonferroni versus Benjamini-Hochberg?

Bonferroni controls family-wise error rate (FWER), the probability of ANY false positive. Use it when even one false positive is costly (clinical trials, regulatory). BH controls false discovery rate (FDR), the expected proportion of false positives. Use it for genomics, A/B test screens, and other discovery contexts where some false positives are acceptable.

What's the difference between Holm and Bonferroni correction?

Holm is uniformly more powerful than Bonferroni and controls the same family-wise error rate. It applies the Bonferroni cutoff only to the smallest p-value, then a slightly less strict cutoff to the next, and so on. There is essentially no reason to use Bonferroni instead of Holm when controlling FWER.

How does FDR control work?

Benjamini-Hochberg (BH) sorts p-values smallest to largest and finds the largest k where p_k <= k * alpha / m. All tests with p-values at or below that threshold are rejected. The expected proportion of false discoveries among the rejections is at most alpha. With m = 1000 tests and alpha = 0.05, you expect about 5% of "discoveries" to be false.

Multiple Testing Correction

When you run many statistical tests at once, some will look significant by chance alone. Bonferroni, Holm, BH (FDR), and BY adjust the p-values to control this. Paste your p-value vector, pick a method, and see which results survive correction, with the math worked out step by step.

4 methods · one tool · Bonferroni · Holm · BH · BY · Runs in your browser

Try a real-world example to load.

🧬 RNA-seq 20 genes

Twenty differentially-expressed genes from an RNA-seq screen. Most are null; we want FDR-controlled discoveries.

Output

REJECTED AFTER CORRECTION

- / -

R code RUNNABLE

R Reproduce in R

P-value adjustment INTERACTIVE

Inference

We adjusted your p-values to control either the family-wise error rate (Bonferroni / Holm) or the false discovery rate (Benjamini-Hochberg).

Read more Anatomy of multiple-testing correction

Bonferroni: adj_p_(i) = min(1, m · p_(i)) reject if adj_p ≤ α

Bonferroni. Multiply every p-value by the family size m, cap at 1. Controls the family-wise error rate strictly. The simplest correction; also the most conservative - it leaves real findings on the table when m is large.

Holm (step-down): sort p-values p_(1) ≤ … ≤ p_(m) adj_p_(i) = max over j≤i of (m − j + 1) · p_(j) capped at 1, monotone

Holm. Walk up the sorted p-values, scaling each by a smaller factor than Bonferroni. Same FWER guarantee, uniformly more powerful - rejects every test Bonferroni rejects, and sometimes more.

BH (Benjamini–Hochberg, FDR): adj_p_(i) = min over j≥i of (m / j) · p_(j) reject if adj_p ≤ α

BH. The default for genomics, brain imaging, screens. Controls the expected false discovery rate (the fraction of significant calls that are false). Far more powerful than FWER methods when m is large and many true alternatives exist.

BY (Benjamini–Yekutieli): c(m) = Σ 1/k for k = 1..m adj_p_(i) = min over j≥i of (m · c(m) / j) · p_(j)

BY. The robust cousin of BH. The harmonic factor c(m) bakes in worst-case dependence between tests - useful when test statistics are correlated (genes in the same pathway, voxels in the same region). Slightly less powerful than BH; valid under arbitrary dependence.

Caveats When this is the wrong tool

If you have…: Use instead
One primary endpoint and many secondary tests: Use a fixed-sequence or hierarchical procedure (Maurer–Bretz). A blanket correction over the whole family wastes power.
Hierarchical / structured tests (e.g., genomic regions, brain ROIs): Group-FDR or hierarchical FDR (Yekutieli, Heller) - better power for structured families than flat BH.
Highly dependent tests (linkage, time series, repeated measures): BY (FDR under dependence) or a permutation-based correction. BH assumes weak dependence.
Sequential / interim looks at a trial: Group-sequential or alpha-spending designs (O'Brien–Fleming, Pocock). Single-look correction is the wrong model.
Storey q-value with bootstrap π̂₀ smoothing: This tool ships Bonferroni / Holm / BH / BY. For full smoothing or bootstrap π₀, use qvalue::qvalue() in R.
You're picking the method after seeing the corrected p-values: Stop. Pick the method first; data go through it. The other order is p-hacking.

Multiple Testing Correction

How we got there