Equivalence Testing in R: TOST for Non-Inferiority & Bioequivalence
Equivalence testing flips the usual question: instead of asking "is there a difference?", it asks "are these two groups close enough that the difference doesn't matter?" The TOST procedure (Two One-Sided Tests) answers that with a pair of t-tests, supporting non-inferiority trials, generic-vs-brand drug comparisons, and any claim that two methods give practically the same result.
How do you prove two means are practically equivalent in R?
A standard t-test can only reject sameness, never confirm it. TOST flips the logic. You pre-specify a range of differences too small to care about, then run two one-sided t-tests against the edges of that range. When both one-sided tests reject their null, the true difference must sit inside the range, and the groups are declared statistically equivalent. Here is that idea running on two blood pressure monitors that should read the same patient within 2 mmHg.
Both one-sided tests reject their null, with p-values around 0.0001 and 0.0002. Since the larger of the two is well below 0.05, the true mean difference sits inside the ±2 mmHg window and the monitors are declared equivalent. A classical two-sided t.test(monitor_a, monitor_b) on the same data returns p = 0.92, which tells you only that you failed to detect a difference, not that the monitors agree.
Try it: Tighten the bounds to ±0.1 mmHg (a much stricter claim). What do the two p-values look like, and why does the conclusion flip?
Click to reveal solution
Explanation: Observed mean difference is only 0.05, which is well inside ±0.1. But the sample isn't precise enough to prove the difference is smaller than 0.1 mmHg. The test is conservative: you need evidence, not just a favourable point estimate.
Why can't a standard t-test prove equivalence?
A t-test asks one question: "Is the observed difference bigger than noise?" A non-significant p-value means you failed to reject the null of no difference. It never means the null is true. With a small sample, you can miss very real differences and still get a large p-value, so treating "p > 0.05" as proof of equivalence is a recipe for false claims.
Watch this happen: two groups whose means actually differ by over a full unit, tested with just five observations per group.
The t-test returns p = 0.087, which fails to reject the null of equal means. A reader who confuses "fails to reject" with "no difference" will happily claim the groups are the same. But the means really do differ by 1.3 units; the small sample just couldn't prove it. The high p-value reflects weak evidence, not equivalence.
Try it: Simulate the same underlying scenario with N = 500 per group, using rnorm(). Does the t-test now detect the difference?
Click to reveal solution
Explanation: Same true difference, much larger N, and the p-value collapses to essentially zero. The "non-significance" at N=5 was a power problem, not a truth about the population.
How does TOST combine two one-sided tests?
TOST formalises the flip. You pick a lower bound $L$ and an upper bound $U$ around zero that represent the largest difference you'd call "practically nothing." Then you run two one-sided tests:
$$H_{0,\text{lower}}: \mu_A - \mu_B \le L \qquad H_{0,\text{upper}}: \mu_A - \mu_B \ge U$$
Reject both at significance level $\alpha$ and you've shown the true difference lies strictly inside $(L, U)$. In a two-sample setting, the two test statistics are:
$$t_{\text{lower}} = \frac{\bar{x}_A - \bar{x}_B - L}{SE} \qquad t_{\text{upper}} = \frac{\bar{x}_A - \bar{x}_B - U}{SE}$$
Where:
- $\bar{x}_A, \bar{x}_B$ = sample means
- $SE$ = standard error of the mean difference
- $L, U$ = lower and upper equivalence bounds
The tost() helper below wraps base R's t.test() into a tidy summary.
One call, one verdict. The helper reports the observed difference (0.05), both one-sided p-values, the larger of the two (this is the TOST p-value you'd publish), and a boolean decision. An equivalent and very common way to read the result is the 90% confidence interval: if the (1 − 2α) CI for the mean difference lies entirely inside $[L, U]$, equivalence holds. This is the approach FDA bioequivalence reviews prefer because the CI is easier to interpret than two p-values.
Try it: Extend tost() to also return the 90% CI on the mean difference, so a reader can check whether it sits inside the bounds.
Click to reveal solution
Explanation: The 90% CI falls strictly inside [-2, 2], which is the confidence-interval restatement of the same TOST result. Two equivalent ways of writing one conclusion.
How do you run a non-inferiority test in R?
Non-inferiority testing is TOST with only one side. You don't care if the new treatment is better, only that it isn't meaningfully worse. The margin is how much worse you'd tolerate. That turns into a single one-sided t-test: reject the hypothesis that the new treatment is at least margin worse, and you've shown non-inferiority.
Suppose drug B is a cheaper generic version of drug A, and you're willing to accept B being up to 0.5 seconds slower. If the data rule out "B is 0.5 seconds or more slower," you can market B with a straight face.
The p-value is essentially zero, so you reject "B is 0.5+ seconds slower than A." Drug B is non-inferior to drug A at the 0.5-second margin. Note what you have not shown: that B is equivalent (you'd need a second test against a lower bound), or that B is superior (you'd need a test against mu = 0). Each of those claims requires its own hypothesis.
Try it: Using the same drug_a and drug_b, test whether drug B is superior (strictly faster than A, not just non-inferior). Expect to fail to reject, because B is actually slower.
Click to reveal solution
Explanation: Superiority is the same structure with mu = 0. B's mean is 2.24 vs A's 2.12, so the one-sided test in the "B < A" direction fails completely. You can't claim B is faster; you can only claim B is close enough.
How do you test bioequivalence with the FDA 80-125% rule?
Bioequivalence is TOST with three twists: the data are paired (each subject takes both formulations), the analysis is on the log scale, and the FDA bounds are [80%, 125%]. The log scale matters because pharmacokinetic data are multiplicative, and 80% and 125% are reciprocal ratios ($1/0.8 = 1.25$) that become symmetric bounds of $\pm\ln(1.25) \approx \pm 0.223$ once you log-transform. This is what makes the asymmetric-looking rule a clean two-sided TOST.
The target quantity is the geometric mean ratio (GMR) of the test drug to the reference drug. If the 90% CI on the GMR lies entirely inside [80%, 125%], the drugs are bioequivalent.
Both one-sided tests reject their bounds with tiny p-values, and the 90% CI on the geometric mean ratio is roughly [99.5%, 102.7%]. That interval sits comfortably inside [80%, 125%], so the test drug is declared bioequivalent to the reference. The point estimate (exp of the mean log ratio) is about 101%, which means the test drug produces ~1% more AUC on average, well inside the acceptable window.
Try it: For narrow-therapeutic-index drugs, regulators sometimes tighten the bounds to [90%, 111.11%] (a ±10% band). Re-run TOST with these tighter bounds on the same auc_test and auc_ref.
Click to reveal solution
Explanation: Both p-values drop but remain well below 0.05. Because the 90% CI sits near 100%, even the tighter NTI window is comfortably wider than the uncertainty in the data. A test drug with more variable PK or a larger geometric mean shift could easily fail NTI while passing the standard 80-125% rule.
How do you choose equivalence bounds that make sense?
The bound is the most consequential choice in a TOST. A bound of ±5 units will pass almost anything; a bound of ±0.01 will fail almost everything. Three legitimate strategies exist:
- Clinical or practical minimum. The smallest effect a domain expert would act on. A 2 mmHg difference between BP monitors is clinically noise; a 10 mmHg difference is not.
- Empirical reference. The effect size the prior study had 33% power to detect. Anything smaller than that effect "the original couldn't have seen anyway" (Simonsohn's small-telescopes approach).
- Regulatory. FDA bioequivalence uses [80%, 125%] on the log scale; EMA guidelines follow similarly; clinical trial protocols lock the margin in advance.
The one rule: pick the bound before you look at the data.
Observed mean difference is about -0.23, so any bound of ±0.5 or more passes easily. At ±0.3 the test is inconclusive (max p = 0.09, just over the alpha line) because the bound is barely wider than the difference itself. At ±0.2 and ±0.1, the data outright contradict equivalence. The "right" bound lives outside R: it is whatever width the decision-maker considers too small to act on.
Try it: Using sample_d_x and sample_d_y, find the minimum bound (to two decimal places) at which equivalence is just achieved. A small search loop works.
Click to reveal solution
Explanation: The smallest symmetric bound that passes TOST on this data is roughly ±0.37. Note this procedure is only valid as a diagnostic. Using this search to choose the bound you report would be cherry-picking.
Practice Exercises
Exercise 1: Generic vs brand-name ibuprofen
You're testing whether a generic ibuprofen delivers pain relief equivalent to the brand-name version. Equivalence is defined as the mean time to relief being within ±5 minutes of the brand. Run TOST on the data below and save the result to my_tost1.
Click to reveal solution
Explanation: Observed difference is 1.55 minutes, well inside ±5. Both one-sided tests reject their bounds, so the generic is declared equivalent to the brand.
Exercise 2: Non-inferiority from summary statistics
Write a function non_inferiority_from_summary(m1, s1, n1, m2, s2, n2, margin) that runs a non-inferiority test from summary statistics alone (no raw data). Use group 1 as the new treatment and group 2 as the standard. A larger value is better. Return the p-value and a pass/fail decision. Apply it to: new = (45, 8, 120), standard = (47, 9, 115), margin = 5 units worse acceptable.
Click to reveal solution
Explanation: The new treatment's mean is 2 units lower than standard, but the 5-unit margin easily absorbs that gap. The one-sided p-value is 0.0045, so non-inferiority is established.
Exercise 3: Paired bioequivalence on AUC
Twenty-four subjects each took a test drug and a reference drug. Paired AUC values are supplied below. Run a paired TOST on the log ratios with FDA bounds ±log(1.25), and report the 90% CI back on the percent scale. Save the log ratios to my_log_ratio.
Click to reveal solution
Explanation: Geometric mean ratio is ~104%, and the 90% CI is narrowly inside the FDA window. Both one-sided tests reject with tiny p-values. Bioequivalence is declared.
Complete Example
A real bioequivalence dossier evaluates at least two PK parameters. The drug passes only if both pass. Here we run the full decision on Cmax (peak concentration) and AUC (total exposure) using a helper, then build a one-line summary.
Both parameters pass: Cmax GMR is 105% with a 90% CI of [103%, 107%], and AUC GMR is 103% with 90% CI [102%, 105%]. Both CIs sit well inside [80%, 125%], so the test drug is declared bioequivalent to the reference for regulatory submission. If either parameter's CI had touched or crossed a bound, the entire submission would fail.
Summary
| Use case | Bounds convention | R call pattern |
|---|---|---|
| Two-sided equivalence | ±L, set from domain knowledge | Two t.test() calls with alternative = "greater" and "less" at mu = -L and +L |
| Non-inferiority | One-sided, margin = worst-acceptable worsening | One t.test() with alternative = "less" and mu = margin |
| Bioequivalence (FDA) | [80%, 125%] on log scale → ±log(1.25) | Paired one-sample TOST on log(test) - log(ref) |
Key takeaways:
- A non-significant t-test never proves equivalence. Run TOST if you want to support that claim.
- TOST combines two one-sided tests and decides equivalence when both reject at alpha.
- The 90% CI restatement is equivalent to the two-p-value version and is easier to communicate.
- Non-inferiority is TOST with only the lower bound active.
- Bioequivalence applies TOST on the log scale with regulator-defined [80%, 125%] bounds.
- The bound is a policy decision made before data collection, not a number to tune after the fact.
References
- Lakens, D. (2017). Equivalence Tests: A Practical Primer for t-Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4). Link
- Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence Testing for Psychological Research: A Tutorial. Advances in Methods and Practices in Psychological Science, 1(2). Link
- Schuirmann, D. J. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. Journal of Pharmacokinetics and Biopharmaceutics, 15(6). Link
- U.S. Food and Drug Administration. Statistical Approaches to Establishing Bioequivalence (Guidance for Industry). Link
- Caldwell, A. TOSTER R package vignettes. CRAN. Link
- Walker, E., & Nowacki, A. S. (2011). Understanding Equivalence and Noninferiority Testing. Journal of General Internal Medicine, 26(2). Link
- European Medicines Agency. Guideline on the Investigation of Bioequivalence. Link
Continue Learning
- Hypothesis Testing in R, the null-hypothesis framework that TOST inverts, with the p-value mechanics you'll need behind every one-sided test.
- Sample Size in R, power analysis to size a TOST study before you collect data, so your bounds are realistic for your sample.
- Confidence Intervals in R, the CI interpretation that anchors the 90% interval approach regulators use for bioequivalence decisions.