Equivalence / Non-Inferiority Calculator
A non-significant p-value is not proof that two groups are equivalent. The TOST procedure (two one-sided tests) flips the burden of proof so you can actually claim 'similar enough'. Plan a sample size or analyze finished data against equivalence bounds, including non-inferiority and asymmetric margins.
What is equivalence testing? ▾
A non-significant p-value is not evidence of equivalence. The standard t-test asks whether two groups differ; failing to reject the null only means you didn't see a difference, not that none exists. To argue two treatments are practically the same, you flip the question and ask: is the difference small enough that anyone would care?
Equivalence testing answers that with two one-sided tests (TOST): you pre-declare a smallest effect size of interest, expressed as a lower and upper bound around zero, and you run two tests, one against each bound. If both reject at level α, the data are inconsistent with a difference larger than the bound in either direction, and you may conclude equivalence. The corresponding 90% confidence interval (not 95%) sits entirely inside the equivalence interval.
Non-inferiority is a one-sided cousin: you only care that the new treatment is not much worse than the active control. You set a single lower bound (the non-inferiority margin) and run one t-test against it. Super-superiority mirrors the upper side: arguing the new treatment is more than a meaningful amount better.
The hard part is not the math, it is picking the bound. The bound has to be set before seeing the data and should be defended on clinical, regulatory, or business grounds, not by what your sample happened to deliver. Standardized choices like d = 0.4 are starting points; the right number is whichever effect a reasonable person would consider negligible.
Try a real-world example to load.
Read more Anatomy of TOST and the equivalence interval
TOSTER::powerTOSTtwo to the integer.Caveats When the verdict can mislead
- Pitfall
- What to do
- The bound was chosen after seeing the data.
- This is the cardinal sin of equivalence testing. The bound encodes "what counts as practically the same" and must be defensible without reference to the data. Pre-register it.
- Bound is ±1.0 standard deviations on a clinically meaningful endpoint.
- That is enormous. A wide bound makes equivalence trivially easy to declare. Tighter is more honest; aim for what regulators or clinicians would accept.
- Sample size was set for power against a difference of zero.
- If the true difference is non-zero (even if within the bound), you need a bigger sample. The plan-mode formula assumes μ = 0; bump it up if you expect a small but real difference.
- You ran a t-test, got p > 0.05, and concluded "no difference."
- Absence of evidence is not evidence of absence. Use TOST to make the equivalence claim positive; otherwise the result is inconclusive.
- Non-inferiority + active control with no placebo arm.
- Constancy assumption: you assume the active control is still as effective as in earlier trials. If that fails, "non-inferior to control" can mean "both useless."
Further reading Where to learn more
- Schuirmann (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. The paper that named TOST.
- Lakens (2017). Equivalence tests: a practical primer for t-tests, correlations, and meta-analyses. The accessible companion to the TOSTER R package.
- FDA: Non-inferiority clinical trials to establish effectiveness (Guidance for Industry). The regulatory frame for non-inferiority margins.
- TOSTER R package documentation by Aaron Caldwell & Daniel Lakens. The functions this tool mirrors.
- Companion: t-test calculator for the standard difference test, when equivalence is not what you need.