What is the difference between equivalence and non-inferiority testing?

Equivalence tests show two treatments are similar enough, the difference is within +/- delta. Non-inferiority tests show the new treatment is not worse by more than delta in one direction. Both use TOST (two one-sided tests) but with different alternative hypotheses. Choose non-inferiority when the new treatment's advantage is on cost or safety, not efficacy.

How do I choose the equivalence margin?

The margin should be the largest difference that would still be considered clinically or practically irrelevant. For drug trials, regulators often specify it (often 10-20% of the active control effect). Avoid setting it to make a borderline result come out positive, pre-register the margin before unblinding.

Why does TOST use alpha = 0.05 not 0.025?

TOST runs two one-sided tests, but rejecting both at alpha = 0.05 gives an overall Type I error of 0.05, not 0.10, because the rejection regions don't overlap. This is mathematically equivalent to a 90% confidence interval falling entirely within +/- delta. The calculator shows both views.

Equivalence / Non-Inferiority Calculator

A non-significant p-value is not proof that two groups are equivalent. The TOST procedure (two one-sided tests) flips the burden of proof so you can actually claim 'similar enough'. Plan a sample size or analyze finished data against equivalence bounds, including non-inferiority and asymmetric margins.

TOST · two one-sided tests continuous & proportions Plan or Analyze

Try a real-world example to load.

Result

Pick a mode and enter inputs.

RESULT

-to-

point estimate · bounds

Two-sample t-test → A/B test calculator → Power analysis →

R code RUNNABLE

R Reproduce in R

Equivalence chart INTERACTIVE

CI of the difference vs the equivalence bounds.

Inference

We ran two one-sided tests (TOST) to check whether your difference falls within the equivalence margin you set.

Read more Anatomy of TOST and the equivalence interval

The shape of the question

Equivalence flips the standard test on its head. The null is "the effect is large" and the alternative is "the effect is small." If both one-sided tests reject, you have evidence the effect lies between the two pre-specified bounds.

t₁ = (diff − Δ_L) / SE t₂ = (Δ_U − diff) / SE both p < α ⇒ equivalent

Two one-sided tests (TOST). The lower test asks "is the difference reliably above the lower bound?" and the upper test asks "is the difference reliably below the upper bound?" Both must reject at level α for the conclusion. The procedure is operationally identical to checking whether the (1−2α) confidence interval of the difference sits entirely inside the equivalence interval, which is why 90% CI is the standard for α = 0.05, not 95%.

SE = s_p · √(1/n₁ + 1/n₂) df = n₁ + n₂ − 2 s_p = √( ((n₁−1)s₁² + (n₂−1)s₂²) / df )

Two-sample SE (continuous). The pooled-variance form is what TOSTER uses by default. Welch is fine too; the t-distribution and df shift slightly but TOST's logic is unchanged. For the proportion variant, SE is √(p₁(1−p₁)/n₁ + p₂(1−p₂)/n₂) and the test becomes a z-test on the difference.

n ≈ 2σ² (z_{1−α} + z_{1−β/2})² / (Δ − |μ|)²

Sample-size formula (TOST). Chow et al. (2017). For symmetric bounds and assumed difference μ = 0, this collapses to a clean expression in Δ/σ. For non-inferiority, replace z_{1−β/2} with z_{1−β}: a one-sided test needs less power on the unused side. The tool refines this with a noncentral-t iteration so values match TOSTER::powerTOSTtwo to the integer.

verdict = equivalent if 90% CI ⊂ [Δ_L, Δ_U] not equivalent if CI lies wholly outside inconclusive otherwise

Reading the chart. The horizontal bar is the 90% CI of the observed difference; the shaded band is the equivalence interval. Bar fully inside band = equivalence demonstrated. Bar fully outside on either side = a meaningful difference. Bar straddling a bound = inconclusive: the data do not rule out a difference larger than the bound, but they also do not establish one.

Caveats When the verdict can mislead

Pitfall: What to do
The bound was chosen after seeing the data.: This is the cardinal sin of equivalence testing. The bound encodes "what counts as practically the same" and must be defensible without reference to the data. Pre-register it.
Bound is ±1.0 standard deviations on a clinically meaningful endpoint.: That is enormous. A wide bound makes equivalence trivially easy to declare. Tighter is more honest; aim for what regulators or clinicians would accept.
Sample size was set for power against a difference of zero.: If the true difference is non-zero (even if within the bound), you need a bigger sample. The plan-mode formula assumes μ = 0; bump it up if you expect a small but real difference.
You ran a t-test, got p > 0.05, and concluded "no difference.": Absence of evidence is not evidence of absence. Use TOST to make the equivalence claim positive; otherwise the result is inconclusive.
Non-inferiority + active control with no placebo arm.: Constancy assumption: you assume the active control is still as effective as in earlier trials. If that fails, "non-inferior to control" can mean "both useless."

Equivalence / Non-Inferiority Calculator

How we got there