Rr‑statistics.co

A/B Test Calculator

An A/B test compares two versions of something, like two button colors on a website, to tell if one really performs better or if the gap is just random luck. Drop in your numbers to plan how many visitors you'll need, or to check whether the result you already have is real.

i New to A/B testing? Read the 4-min primer

What it is. An A/B test routes traffic to two variants of the same thing (page, email, button), measures a binary outcome (converted? clicked?), and asks whether B's true conversion rate differs from A's, or whether the gap you saw is the kind of jitter you would expect from random sampling.

How to read it. Four numbers do most of the work. Lift is B minus A in percentage points (or relative percent). p-value is the chance of seeing a gap this big when there really is no difference: small p means the gap is hard to dismiss as noise. 95% CI for the lift is the range of differences the data are compatible with: if it crosses zero, you cannot rule out "no effect". BF10 is a Bayes factor: values above 10 are strong evidence for B, below 1/10 are strong evidence for A.

Frequentist vs Bayesian. Frequentist: pre-set alpha and power, run to a fixed sample size, decide on p < alpha. Bayesian: start from a prior (uniform Beta(1,1) is fine), update each arm to a Beta posterior with the data, report P(B > A) and the credible interval for the lift. Both answer the same business question; the math and the assumptions differ.

Picking the mode. Have not run yet? Use Plan to size the experiment from a baseline rate, a minimum lift you would care about, and your desired power. Already done? Use Analyze with the per-arm visitors and conversions. The framing toggle (frequentist, Bayesian, both) controls which numbers show up in column C.

Plan or analyze, frequentist + Bayesian, sequential view, runs in your browser

Try a real-world example to load.

📊 Plan a 2-arm test

Baseline conversion is 10%. We want 80% power to detect a 1pp absolute lift at alpha = 0.05.

RESULT
-
R code RUNNABLE
R Reproduce in R

        
A/B distributions INTERACTIVE
Inference

Read more Anatomy of an A/B test
Two-proportion z-test (frequentist): p_pool = (c_a + c_b) / (n_a + n_b) SE_0 = sqrt( p_pool (1 - p_pool) (1/n_a + 1/n_b) ) z = (p_b - p_a) / SE_0 p = 2 (1 - Phi(|z|))
Two-proportion z-test. Pool the two arms under H0 to estimate one shared rate, compute the standard error from that pooled rate, and read off a z-statistic. The 95% CI for the lift uses the unpooled SE (each arm's own variance) so the interval centers on the observed gap. No continuity correction by default. Matches prop.test(c(c_a, c_b), c(n_a, n_b), correct = FALSE) exactly.
Bayesian beta-binomial conjugate: prior: p_a, p_b ~ Beta(alpha_0, beta_0) posterior: p_a ~ Beta(alpha_0 + c_a, beta_0 + n_a - c_a) p_b ~ Beta(alpha_0 + c_b, beta_0 + n_b - c_b) output: P(p_b > p_a), 95% credible interval for p_b - p_a
Beta-Binomial conjugate. A Beta prior is conjugate to the Binomial likelihood, so the posterior is Beta in closed form. We sample 10,000 draws from each posterior, count the fraction where p_b exceeds p_a, and take the 2.5th and 97.5th percentiles of the lift draws. With a Beta(1,1) (uniform) prior the posterior mean equals (c+1)/(n+2). Replicates rbeta() simulation in R.
Sequential / alpha-spending sanity: each interim look at fraction t of N spend alpha_t such that sum alpha_t = alpha (Pocock) alpha_t = 2 - 2 Phi(z_a / sqrt(t)) (O'Brien-Fleming)
Sequential alpha-spending. If you peek at the data K times and stop early on a single significant look, your real false-positive rate is far above alpha. Group-sequential designs (Pocock, O'Brien-Fleming) spread alpha across the looks so the family-wise error stays under control. The sequential view in column C uses the Pocock-style boundary for K looks at equal information fractions.
Sample-size formula (two-prop, Cohen's h): h = 2 (arcsin sqrt(p2) - arcsin sqrt(p1)) n_per_arm = (z_alpha + z_beta)^2 / h^2 total = n_per_arm * 2 / (4 k (1 - k)) (unequal, B share = k)
Sample size for proportions. The arcsine transform stabilizes variance so both arms contribute equally; that's Cohen's h. Plug your baseline p1, your minimum-detectable rate p2, alpha, and 1-beta (power), and the formula returns n per arm at 50/50 allocation. For unequal splits, scale the total by 1 / (4 k (1-k)). Matches pwr::pwr.2p.test().
Lift, RR, and absolute difference: absolute lift = p_b - p_a (percentage points) relative lift = (p_b - p_a) / p_a risk ratio = p_b / p_a
Lift vs RR vs absolute difference. Be explicit. Saying "B is 20% better" is ambiguous: a baseline of 10% rising to 12% is +2pp absolute, +20% relative, RR = 1.20. Confidence intervals and BF10 are computed on the absolute scale here; the recap line shows the relative figure for stakeholder summaries.
Caveats When this is the wrong tool
If you have…
Use instead
Three or more variants (multi-armed test)
Use a multi-armed bandit (Thompson sampling) for adaptive allocation, or a Bonferroni / Holm correction over pairwise z-tests for fixed exposure.
Continuous outcome (revenue per visitor, time-on-page)
Use a Welch's two-sample t-test (continuous A/B). The two-proportion test forces a binary outcome.
Survival or time-to-event outcome
Kaplan-Meier with a log-rank test, or a Cox proportional hazards model. Censoring breaks the binomial assumption.
Repeated measures on the same user (within-subject)
A paired test (McNemar for binary, paired t-test for continuous), or a mixed-effects model with a user random intercept.
Ratio metrics (clicks-per-visit, conversions-per-session)
Delta-method or bootstrap CI for the ratio. Treating each session as IID conversions is wrong because the denominator is itself random per user.
You peeked early and stopped on a significant look
The reported p is wrong. Use a group-sequential design (alpha-spending) before the test, or always-valid sequential intervals (mSPRT).
Further reading

Numerical accuracy: pnorm uses Hart's approximation (abs error < 7.5e-8); qnorm via Wichura AS 241; Beta sampling via Marsaglia-Tsang gamma. Cross-checked against R's prop.test and pwr.2p.test on > 30 input vectors.