Exact Binomial Test in R: binom.test() for Small Samples
binom.test() in R performs an exact test of a proportion using the binomial distribution directly, returning honest p-values and Clopper-Pearson confidence intervals even when the sample is too small for the normal approximation. Reach for it when total trials are below about thirty, or whenever np or n(1-p) dips under 5.
Why do small samples need an exact binomial test?
The normal approximation behind classic proportion tests assumes the sampling distribution of the sample proportion is roughly bell-shaped. At small sample sizes it simply is not. The discrete binomial has gaps, skew, and sharp edges that a smooth normal curve glosses over, and a rough rule of thumb is that you need n*p >= 5 and n*(1-p) >= 5 before trusting the approximation. binom.test() sidesteps the approximation entirely and computes the p-value from the exact binomial probability mass function, so the answer stays correct no matter how small n is. Here is the payoff on a 12-flip coin experiment.
With only 12 flips, the p-value of 0.0386 is small enough to reject the fair-coin hypothesis at the 5% level. Notice the 95% confidence interval, [0.52, 0.96], sits entirely above 0.5. Both decisions point the same way, and both come from exact binomial probabilities rather than a questionable bell-curve approximation.
n is 8 or 80, binom.test() returns the true probability of seeing a result at least as extreme as yours under the null.Try it: A coin is flipped 8 times and lands heads 7 times. Is this consistent with a fair coin? Use binom.test() with the two-sided default.
Click to reveal solution
Explanation: Two-sided p-value is 0.070, which does not cross the 5% threshold. The confidence interval [0.47, 1.00] just barely includes 0.5, confirming the same non-rejection decision.
How do you call binom.test() in R?
Now that you have seen the output, let's look at the function signature in full so you can configure every run deliberately. binom.test() takes five arguments, most with sensible defaults.
| Argument | Default | Purpose |
|---|---|---|
x |
(required) | Number of successes, or a two-element vector c(successes, failures) |
n |
required if x is a scalar |
Total number of trials |
p |
0.5 |
Hypothesized probability of success under H0 |
alternative |
"two.sided" |
One of "two.sided", "less", "greater" |
conf.level |
0.95 |
Confidence level for the Clopper-Pearson interval |
Let's run the canonical textbook example. A fair six-sided die should land on three once every six rolls, so the null proportion is 1/6. You roll the die 24 times and land a three 9 times, much more than the expected four. Is the die loaded?
With p-value = 0.012 and the 95% CI [0.19, 0.59] sitting wholly above 1/6 ≈ 0.167, the evidence against a fair die is strong. The sample estimate of 0.375 tells you the observed proportion of threes was more than double the null.
p = 0.5 is only the right choice when you are testing a fair coin. Every other scenario needs an explicit null proportion. Forgetting this is a silent way to test the wrong hypothesis and still get a plausible-looking p-value.Try it: Your QA lab inspects 20 widgets and finds 3 defective. The factory's claim is a 5% defect rate. Use binom.test() to evaluate the claim two-sided.
Click to reveal solution
Explanation: The two-sided p-value of 0.075 does not quite clear the 5% bar, even though the sample defect rate (0.15) is three times the claimed rate. Small samples have low power.
How do you run one-sided tests with binom.test()?
When prior knowledge or study design points in one direction, a two-sided test wastes power on the direction you do not care about. Set alternative = "greater" to test whether the true proportion exceeds the null, and alternative = "less" to test the opposite.
Because the alternative is "greater", R concentrates the entire 5% risk in the right tail. The p-value of 0.015 now reflects only P(X >= 14) under a fair-coin null, not both tails. The 95% confidence interval has 1 as its upper bound because, by construction, a right-tailed CI does not bound the upper side.
The left-tailed p-value is 0.098, so at the 5% level you cannot claim the true late-shipment rate is below the 20% benchmark yet, even though the sample rate is just 8%. Small-sample left-tailed tests are often underpowered like this.
binom.test() or use conf.level directly with qbeta().Try it: A coin is flipped 10 times and lands heads 9 times. Test the one-sided hypothesis that the coin is biased toward heads (H1: p > 0.5).
Click to reveal solution
Explanation: The right-tailed p-value of 0.011 is small enough to reject the fair-coin null in favor of a heads-biased coin.
How do you interpret the Clopper-Pearson confidence interval?
The interval printed by binom.test() is not a normal approximation with a wider margin of safety. It is the Clopper-Pearson exact interval, built by inverting the binomial cumulative distribution. Because the binomial is discrete, the interval guarantees actual coverage of at least conf.level, which makes it slightly conservative (wider than strictly needed), but never misleadingly narrow.
Under the hood, each bound is a beta-distribution quantile. For x successes in n trials at confidence level 1 - alpha:
$$CI_L \;=\; B\!\left(\tfrac{\alpha}{2};\ x,\ n - x + 1\right), \qquad CI_U \;=\; B\!\left(1 - \tfrac{\alpha}{2};\ x + 1,\ n - x\right)$$
Where:
- $B(q;\ a,\ b)$ is the inverse CDF (quantile function) of the Beta(a, b) distribution, called
qbeta(q, a, b)in R - $\alpha = 1 - \text{conf.level}$
- $x$ is the number of successes and $n$ the total trials
If you are not interested in the math, skip to the code below, the computed interval is all you need for day-to-day inference.
You can reproduce R's confidence interval by hand, which is the cleanest way to convince yourself Clopper-Pearson is not a black box.
The hand-computed bounds match R's output to the last digit. This is the exact Clopper-Pearson formula, nothing more. That also tells you how to get a Clopper-Pearson CI for any custom confidence level without going through binom.test() at all, you just call qbeta() directly.
binom::binom.confint()) are usually narrower at the same confidence level.Try it: Compute a 99% confidence interval for 4 successes in 20 trials using binom.test() with conf.level = 0.99.
Click to reveal solution
Explanation: Bumping the confidence level from 95% to 99% widens the interval because you demand higher coverage. The sample proportion is still 0.20, but the Clopper-Pearson bounds now span almost half the 0-to-1 scale.
How do you extract p-value and CI programmatically?
Printed output is fine for a single test, but real analyses batch tests across many groups and record the numbers in a tibble or dashboard. The binom.test() return value is an htest list with a fixed set of named components, so you can pull out exactly what you need.
Once you know the component names, pipelines are easy. Here is a batch test across three experimental groups, returning a tidy table that you can feed into a report or plot.
Group A rejects the null at 1%, group B at 5%, and group C is not significant. Having the bounds and estimate as numeric columns makes it trivial to filter, plot error bars, or export to a spreadsheet.
$p.value and $conf.int straight out of the htest list and let the rest of the tidyverse handle filtering, joining, and plotting.Try it: Write a small function ex_ci_width() that takes a binom.test() result and returns the width of its confidence interval.
Click to reveal solution
Explanation: diff() on a two-element vector returns the single-element difference, which equals upper - lower. Wrapping it in a function makes CI widths comparable across many tests in a pipeline.
Practice Exercises
Exercise 1: Rare-event quality test
A stamping machine is rated to produce at most 2% defective parts. You inspect 40 parts and find 3 defective. Test whether the true defect rate exceeds the 2% rating, at the 5% level. Report the p-value, the 95% confidence interval, and your decision.
Click to reveal solution
Explanation: The p-value of 0.053 narrowly misses the 5% cutoff. The 95% lower bound of 0.021 is a hair above the rated 2%, so the evidence suggests the machine is borderline rather than clearly out of spec. A larger inspection sample is warranted.
Exercise 2: Compare two quality tiers
Tier A: 4 defects in 15 items. Tier B: 3 defects in 18 items. For each tier, run binom.test() against a 10% benchmark defect rate with a two-sided alternative, extract the CIs, and return a tibble with columns tier, estimate, ci_lower, ci_upper, and a logical column overlaps_ten_pct that is TRUE when the CI contains 0.10.
Click to reveal solution
Explanation: Both tiers have small samples, so both CIs are wide enough to include the 10% benchmark. Neither tier differs significantly from the benchmark on its own. To directly compare tier A against tier B, you would need prop.test() or fisher.test(), not two one-sample binomial tests.
Complete Example: A/B pilot with 30 users per arm
A product team runs a tiny 30-per-arm pilot before committing to a full-scale A/B test. The 20% conversion rate from last quarter is the benchmark. Arm A (control) saw 8 of 30 convert; arm B (variant) saw 14 of 30. You want to know which arms differ from the 20% baseline.
Arm A is indistinguishable from the 20% baseline at this sample size. Arm B clearly beats it, with a 95% CI of [0.28, 0.66] sitting entirely above 0.20 and a p-value of 0.0008. The next step would be to run prop.test() or fisher.test() to directly compare arm A against arm B, because two one-sample tests against a shared baseline do not establish a difference between the arms.
Summary
binom.test(x, n, p, alternative, conf.level)performs an exact test of a proportion from the binomial pmf, no normal approximation needed.- Use it whenever
nis small, roughly under 30, or whenn*porn*(1-p)is below 5. - Default
p = 0.5is only appropriate for fair-coin tests, set it explicitly for every other scenario. alternative = "greater"and"less"give one-sided tests with one-sided confidence intervals.- The printed CI is Clopper-Pearson, computed from
qbeta(), and is guaranteed conservative. - You can reproduce the CI by hand as
c(qbeta(alpha/2, x, n-x+1), qbeta(1-alpha/2, x+1, n-x)). - The result is an
htestlist with$p.value,$conf.int,$estimate, and$statisticslots, ready for programmatic extraction. - For direct two-sample comparisons, switch to
prop.test()for large samples orfisher.test()for 2x2 tables.
References
- R Core Team.
?binom.testmanual page. Link - Clopper, C.J. & Pearson, E.S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial." Biometrika 26:404-413. Link
- Agresti, A. (2019). Statistical Methods for the Social Sciences, 5th ed., Chapter 5. Pearson.
- Brown, L.D., Cai, T.T., DasGupta, A. (2001). "Interval Estimation for a Binomial Proportion." Statistical Science 16(2):101-133. Link
- Dorai-Raj, S.
binompackage documentation (Wilson, mid-p, and other CIs). Link - Wikipedia. Binomial test. Link
Continue Learning
- Proportion Tests in R, the parent tutorial. When to reach for
prop.test(),binom.test(), orfisher.test(), with a side-by-side decision flow. - Chi-Squared Test in R. Generalises proportion testing to two or more categorical groups.
- Fisher's Exact Test in R. The exact counterpart for small-sample 2x2 contingency tables.