Rr‑statistics.co
All tools

Equivalence / Non-Inferiority Calculator

A non-significant p-value is not proof that two groups are equivalent. The TOST procedure (two one-sided tests) flips the burden of proof so you can actually claim 'similar enough'. Plan a sample size or analyze finished data against equivalence bounds, including non-inferiority and asymmetric margins.

TOST · two one-sided tests continuous & proportions Plan or Analyze
i What is equivalence testing?

A non-significant p-value is not evidence of equivalence. The standard t-test asks whether two groups differ; failing to reject the null only means you didn't see a difference, not that none exists. To argue two treatments are practically the same, you flip the question and ask: is the difference small enough that anyone would care?

Equivalence testing answers that with two one-sided tests (TOST): you pre-declare a smallest effect size of interest, expressed as a lower and upper bound around zero, and you run two tests, one against each bound. If both reject at level α, the data are inconsistent with a difference larger than the bound in either direction, and you may conclude equivalence. The corresponding 90% confidence interval (not 95%) sits entirely inside the equivalence interval.

Non-inferiority is a one-sided cousin: you only care that the new treatment is not much worse than the active control. You set a single lower bound (the non-inferiority margin) and run one t-test against it. Super-superiority mirrors the upper side: arguing the new treatment is more than a meaningful amount better.

The hard part is not the math, it is picking the bound. The bound has to be set before seeing the data and should be defended on clinical, regulatory, or business grounds, not by what your sample happened to deliver. Standardized choices like d = 0.4 are starting points; the right number is whichever effect a reasonable person would consider negligible.

Try a real-world example to load.

Pick a mode and enter inputs.
RESULT
-to-
point estimate · bounds
R code RUNNABLE
R Reproduce in R

        
Equivalence chart INTERACTIVE
CI of the difference vs the equivalence bounds.
Inference

Read more Anatomy of TOST and the equivalence interval
The shape of the question
Equivalence flips the standard test on its head. The null is "the effect is large" and the alternative is "the effect is small." If both one-sided tests reject, you have evidence the effect lies between the two pre-specified bounds.
t₁ = (diff − Δ_L) / SE t₂ = (Δ_U − diff) / SE both p < α ⇒ equivalent
Two one-sided tests (TOST). The lower test asks "is the difference reliably above the lower bound?" and the upper test asks "is the difference reliably below the upper bound?" Both must reject at level α for the conclusion. The procedure is operationally identical to checking whether the (1−2α) confidence interval of the difference sits entirely inside the equivalence interval, which is why 90% CI is the standard for α = 0.05, not 95%.
SE = s_p · √(1/n₁ + 1/n₂) df = n₁ + n₂ − 2 s_p = √( ((n₁−1)s₁² + (n₂−1)s₂²) / df )
Two-sample SE (continuous). The pooled-variance form is what TOSTER uses by default. Welch is fine too; the t-distribution and df shift slightly but TOST's logic is unchanged. For the proportion variant, SE is √(p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂) and the test becomes a z-test on the difference.
n ≈ 2σ² (z_{1−α} + z_{1−β/2})² / (Δ − |μ|)²
Sample-size formula (TOST). Chow et al. (2017). For symmetric bounds and assumed difference μ = 0, this collapses to a clean expression in Δ/σ. For non-inferiority, replace z_{1−β/2} with z_{1−β}: a one-sided test needs less power on the unused side. The tool refines this with a noncentral-t iteration so values match TOSTER::powerTOSTtwo to the integer.
verdict = equivalent if 90% CI ⊂ [Δ_L, Δ_U] not equivalent if CI lies wholly outside inconclusive otherwise
Reading the chart. The horizontal bar is the 90% CI of the observed difference; the shaded band is the equivalence interval. Bar fully inside band = equivalence demonstrated. Bar fully outside on either side = a meaningful difference. Bar straddling a bound = inconclusive: the data do not rule out a difference larger than the bound, but they also do not establish one.
Caveats When the verdict can mislead
Pitfall
What to do
The bound was chosen after seeing the data.
This is the cardinal sin of equivalence testing. The bound encodes "what counts as practically the same" and must be defensible without reference to the data. Pre-register it.
Bound is ±1.0 standard deviations on a clinically meaningful endpoint.
That is enormous. A wide bound makes equivalence trivially easy to declare. Tighter is more honest; aim for what regulators or clinicians would accept.
Sample size was set for power against a difference of zero.
If the true difference is non-zero (even if within the bound), you need a bigger sample. The plan-mode formula assumes μ = 0; bump it up if you expect a small but real difference.
You ran a t-test, got p > 0.05, and concluded "no difference."
Absence of evidence is not evidence of absence. Use TOST to make the equivalence claim positive; otherwise the result is inconclusive.
Non-inferiority + active control with no placebo arm.
Constancy assumption: you assume the active control is still as effective as in earlier trials. If that fails, "non-inferior to control" can mean "both useless."
Further reading Where to learn more