A/B Test Calculator

An A/B test compares two versions of something, like two button colors on a website, to tell if one really performs better or if the gap is just random luck. Drop in your numbers to plan how many visitors you'll need, or to check whether the result you already have is real.

Plan or analyze, frequentist + Bayesian, sequential view, runs in your browser

Try a real-world example to load.

📊 Plan a 2-arm test

Baseline conversion is 10%. We want 80% power to detect a 1pp absolute lift at alpha = 0.05.

Output

RESULT

-

R code RUNNABLE

R Reproduce in R

A/B distributions INTERACTIVE

Inference

We compared your two groups using a two-proportion z-test (frequentist) and a beta-binomial Bayesian model (probability that B is better) side by side.

Read more Anatomy of an A/B test

Two-proportion z-test (frequentist): p_pool = (c_a + c_b) / (n_a + n_b) SE_0 = sqrt( p_pool (1 - p_pool) (1/n_a + 1/n_b) ) z = (p_b - p_a) / SE_0 p = 2 (1 - Phi(|z|))

Two-proportion z-test. Pool the two arms under H0 to estimate one shared rate, compute the standard error from that pooled rate, and read off a z-statistic. The 95% CI for the lift uses the unpooled SE (each arm's own variance) so the interval centers on the observed gap. No continuity correction by default. Matches prop.test(c(c_a, c_b), c(n_a, n_b), correct = FALSE) exactly.

Bayesian beta-binomial conjugate: prior: p_a, p_b ~ Beta(alpha_0, beta_0) posterior: p_a ~ Beta(alpha_0 + c_a, beta_0 + n_a - c_a) p_b ~ Beta(alpha_0 + c_b, beta_0 + n_b - c_b) output: P(p_b > p_a), 95% credible interval for p_b - p_a

Beta-Binomial conjugate. A Beta prior is conjugate to the Binomial likelihood, so the posterior is Beta in closed form. We sample 10,000 draws from each posterior, count the fraction where p_b exceeds p_a, and take the 2.5th and 97.5th percentiles of the lift draws. With a Beta(1,1) (uniform) prior the posterior mean equals (c+1)/(n+2). Replicates rbeta() simulation in R.

Sequential / alpha-spending sanity: each interim look at fraction t of N spend alpha_t such that sum alpha_t = alpha (Pocock) alpha_t = 2 - 2 Phi(z_a / sqrt(t)) (O'Brien-Fleming)

Sequential alpha-spending. If you peek at the data K times and stop early on a single significant look, your real false-positive rate is far above alpha. Group-sequential designs (Pocock, O'Brien-Fleming) spread alpha across the looks so the family-wise error stays under control. The sequential view in column C uses the Pocock-style boundary for K looks at equal information fractions.

Sample-size formula (two-prop, Cohen's h): h = 2 (arcsin sqrt(p2) - arcsin sqrt(p1)) n_per_arm = (z_alpha + z_beta)^2 / h^2 total = n_per_arm * 2 / (4 k (1 - k)) (unequal, B share = k)

Sample size for proportions. The arcsine transform stabilizes variance so both arms contribute equally; that's Cohen's h. Plug your baseline p1, your minimum-detectable rate p2, alpha, and 1-beta (power), and the formula returns n per arm at 50/50 allocation. For unequal splits, scale the total by 1 / (4 k (1-k)). Matches pwr::pwr.2p.test().

Lift, RR, and absolute difference: absolute lift = p_b - p_a (percentage points) relative lift = (p_b - p_a) / p_a risk ratio = p_b / p_a

Lift vs RR vs absolute difference. Be explicit. Saying "B is 20% better" is ambiguous: a baseline of 10% rising to 12% is +2pp absolute, +20% relative, RR = 1.20. Confidence intervals and BF10 are computed on the absolute scale here; the recap line shows the relative figure for stakeholder summaries.

Caveats When this is the wrong tool

If you have…: Use instead
Three or more variants (multi-armed test): Use a multi-armed bandit (Thompson sampling) for adaptive allocation, or a Bonferroni / Holm correction over pairwise z-tests for fixed exposure.
Continuous outcome (revenue per visitor, time-on-page): Use a Welch's two-sample t-test (continuous A/B). The two-proportion test forces a binary outcome.
Survival or time-to-event outcome: Kaplan-Meier with a log-rank test, or a Cox proportional hazards model. Censoring breaks the binomial assumption.
Repeated measures on the same user (within-subject): A paired test (McNemar for binary, paired t-test for continuous), or a mixed-effects model with a user random intercept.
Ratio metrics (clicks-per-visit, conversions-per-session): Delta-method or bootstrap CI for the ratio. Treating each session as IID conversions is wrong because the denominator is itself random per user.
You peeked early and stopped on a significant look: The reported p is wrong. Use a group-sequential design (alpha-spending) before the test, or always-valid sequential intervals (mSPRT).

Further reading

A/B testing in R, end-to-end - design, analysis, reporting in one tutorial.
Hypothesis testing in R - the conceptual underpinnings of frequentist inference.
Power analysis in R - how MDE, alpha, and power trade off.
Proportion tests in R - prop.test, binom.test, and when to pick which.
Confidence interval calculator - if you only need the CI for the lift.
Power analysis tool - sample size for means, proportions, and correlations.

Numerical accuracy: pnorm uses Hart's approximation (abs error < 7.5e-8); qnorm via Wichura AS 241; Beta sampling via Marsaglia-Tsang gamma. Cross-checked against R's prop.test and pwr.2p.test on > 30 input vectors.