A/B Test Calculator
An A/B test compares two versions of something, like two button colors on a website, to tell if one really performs better or if the gap is just random luck. Drop in your numbers to plan how many visitors you'll need, or to check whether the result you already have is real.
New to A/B testing? Read the 4-min primer ▾
What it is. An A/B test routes traffic to two variants of the same thing (page, email, button), measures a binary outcome (converted? clicked?), and asks whether B's true conversion rate differs from A's, or whether the gap you saw is the kind of jitter you would expect from random sampling.
How to read it. Four numbers do most of the work. Lift is B minus A in percentage points (or relative percent). p-value is the chance of seeing a gap this big when there really is no difference: small p means the gap is hard to dismiss as noise. 95% CI for the lift is the range of differences the data are compatible with: if it crosses zero, you cannot rule out "no effect". BF10 is a Bayes factor: values above 10 are strong evidence for B, below 1/10 are strong evidence for A.
Frequentist vs Bayesian. Frequentist: pre-set alpha and power, run to a fixed sample size, decide on p < alpha. Bayesian: start from a prior (uniform Beta(1,1) is fine), update each arm to a Beta posterior with the data, report P(B > A) and the credible interval for the lift. Both answer the same business question; the math and the assumptions differ.
Picking the mode. Have not run yet? Use Plan to size the experiment from a baseline rate, a minimum lift you would care about, and your desired power. Already done? Use Analyze with the per-arm visitors and conversions. The framing toggle (frequentist, Bayesian, both) controls which numbers show up in column C.
Try a real-world example to load.
Baseline conversion is 10%. We want 80% power to detect a 1pp absolute lift at alpha = 0.05.
Read more Anatomy of an A/B test
prop.test(c(c_a, c_b), c(n_a, n_b), correct = FALSE) exactly.rbeta() simulation in R.pwr::pwr.2p.test().Caveats When this is the wrong tool
- If you have…
- Use instead
- Three or more variants (multi-armed test)
- Use a multi-armed bandit (Thompson sampling) for adaptive allocation, or a Bonferroni / Holm correction over pairwise z-tests for fixed exposure.
- Continuous outcome (revenue per visitor, time-on-page)
- Use a Welch's two-sample t-test (continuous A/B). The two-proportion test forces a binary outcome.
- Survival or time-to-event outcome
- Kaplan-Meier with a log-rank test, or a Cox proportional hazards model. Censoring breaks the binomial assumption.
- Repeated measures on the same user (within-subject)
- A paired test (McNemar for binary, paired t-test for continuous), or a mixed-effects model with a user random intercept.
- Ratio metrics (clicks-per-visit, conversions-per-session)
- Delta-method or bootstrap CI for the ratio. Treating each session as IID conversions is wrong because the denominator is itself random per user.
- You peeked early and stopped on a significant look
- The reported p is wrong. Use a group-sequential design (alpha-spending) before the test, or always-valid sequential intervals (mSPRT).
- A/B testing in R, end-to-end - design, analysis, reporting in one tutorial.
- Hypothesis testing in R - the conceptual underpinnings of frequentist inference.
- Power analysis in R - how MDE, alpha, and power trade off.
- Proportion tests in R - prop.test, binom.test, and when to pick which.
- Confidence interval calculator - if you only need the CI for the lift.
- Power analysis tool - sample size for means, proportions, and correlations.
Numerical accuracy: pnorm uses Hart's approximation (abs error < 7.5e-8); qnorm via Wichura AS 241; Beta sampling via Marsaglia-Tsang gamma. Cross-checked against R's prop.test and pwr.2p.test on > 30 input vectors.