Exact Logistic Regression in R: When Separation Causes Problems
Exact logistic regression is a small-sample method that gives reliable p-values and confidence intervals when standard glm() breaks down because of complete or quasi-complete separation, where one predictor pattern perfectly classifies the outcome.
What goes wrong when glm() meets separation?
Standard glm() fits logistic regression by maximum likelihood, and that algorithm only works when no predictor pattern perfectly classifies the outcome. The moment one does, the likelihood keeps climbing as a coefficient marches toward infinity, and glm() either fails to converge or returns silly numbers with vast standard errors. Let's force this to happen on a tiny clinical dataset, watch R complain, and use that as our entry point.
Look at those numbers. The treatment effect is estimated at ~53 on the log-odds scale, with a standard error of ~110,000 and a p-value of 0.9996. R also raised the warning glm.fit: fitted probabilities numerically 0 or 1 occurred. The model didn't fail loudly. It fell over quietly and still printed a coefficient table that looks like real output. The huge standard error is the giveaway: nothing here is reliable.
summary() table will look normal at a glance. Always scan for that warning and check whether any standard error is implausibly large.Try it: Flip one outcome so separation is broken (change one treated patient to a non-responder). Refit glm() and confirm the warning disappears and the standard error becomes finite.
Click to reveal solution
Explanation: Once a single overlap exists, the likelihood has a finite maximum and glm() returns sensible standard errors.
What is separation and why does it break maximum likelihood?
There are two kinds of separation. Complete separation happens when a predictor (or combination of predictors) perfectly classifies every outcome, like our 12-patient trial above. Quasi-complete separation is milder: a predictor perfectly classifies for some covariate combinations but not others, often because a small subgroup has zero events.
Either way, the trouble shows up in the same place. The log-likelihood for logistic regression is
$$\ell(\beta) = \sum_{i=1}^{n} \big[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \big]$$
Where $p_i = 1 / (1 + e^{-x_i^\top \beta})$ is the predicted probability. Under separation, the data are consistent with $p_i = y_i$ exactly, which requires $\beta \to \pm\infty$. The optimizer keeps climbing the likelihood and never lands. R's IRLS routine gives up after a fixed number of iterations and reports whatever huge value it reached. Let's see this directly in the predicted probabilities.
The fitted probabilities for non-treated patients are essentially zero (2.6e-12), and for treated patients they will be essentially one. That is the smoking gun. If predicted probabilities pile up at the boundary, separation is the cause and any reported coefficient or p-value is meaningless.
Try it: Use predict(m_glm, type = "response") on trial and confirm the maximum is essentially 1 and the minimum is essentially 0.
Click to reveal solution
Explanation: Both endpoints sit at the floor and ceiling of the [0, 1] range, which is the diagnostic for separation.
How do you fit an exact logistic regression with elrm()?
Exact logistic regression sidesteps the infinite-coefficient problem by changing the question. Instead of solving for $\beta$, it conditions on the sufficient statistics for the nuisance parameters and computes the exact discrete distribution of the test statistic for the parameter of interest. The conditional likelihood is
$$L(\beta_1 \mid t_0) = \frac{c(t_1) e^{\beta_1 t_1}}{\sum_{u} c(u) e^{\beta_1 u}}$$
Where $c(u)$ counts the number of tables consistent with the conditioning statistics. The sum is finite, the maximum is finite, and separation is no longer fatal.
elrm instead samples the conditional distribution with a Markov chain Monte Carlo (MCMC) algorithm, hence the "exact-like" label. For small problems the answer is effectively exact; for larger ones it is an arbitrarily good approximation.elrm() expects data in aggregated form: one row per unique covariate pattern, with two outcome columns, the number of trials and the number of successes. We aggregate our 12-patient trial into two rows, one per treatment group, then fit.
The aggregated table is the structure elrm needs: 6 trials with 0 successes in the control arm, 6 trials with 6 successes in the treatment arm. That is the same separated data as before, just folded down. Now look at the fitted summary.
The exact estimate of the treatment log-odds is around 3.47 (an odds ratio of about 32), with a 95% confidence interval of [1.07, ∞). The upper bound is infinite because separation still means there is no upper limit consistent with the data, but the lower bound is finite and informative. That is the punchline: exact logistic regression cannot manufacture a finite point estimate when one does not exist, but it can give you a defensible lower bound and a small, honest p-value.
iter upward when the MCMC standard error of the p-value is large relative to the p-value itself. A rule of thumb from the elrm authors is to make sure each parameter chain has at least 1,000 useable samples after burn-in.Try it: Re-run the exact fit with iter = 50000 and compare the treatment p-value's standard error to the original 22,000-iteration run.
Click to reveal solution
Explanation: MCMC error decays at $1/\sqrt{N}$, so doubling iterations cuts standard error by about 30 percent.
How does Firth's penalized regression compare?
Firth's method takes a different route. It modifies the score function with a Jeffreys-prior penalty that shrinks the estimate away from infinity, so the optimizer always lands on a finite value. The penalized log-likelihood is
$$\ell^{*}(\beta) = \ell(\beta) + \tfrac{1}{2} \log \big| I(\beta) \big|$$
Where $I(\beta)$ is the Fisher information matrix. The extra term acts like a soft regularizer that vanishes for large samples and prevents the runaway in small ones. The logistf package implements this in a familiar formula interface.
Firth gives a finite log-odds estimate of about 5.13 with a 95% confidence interval of [1.34, 12.52] and a p-value of 0.003. The point estimate is larger than elrm's because the two methods optimize different objectives, but they agree on direction, magnitude scale, and significance. Firth runs in milliseconds and works at any sample size, which is why it has become the practical default in many fields.
Try it: Extract just the coefficient vector from m_firth and compare it side by side with m_elrm$coeffs.
Click to reveal solution
Explanation: Both methods give finite, positive estimates of the treatment effect. The numerical disagreement reflects different inference targets (penalized MLE vs conditional median unbiased estimate), not a contradiction.
Which method should you choose: exact, Firth, or Bayesian?
In practice you will not pick one method from a coin flip. Each has a sweet spot, summarised below.
| Method | Best when | Trade-off |
|---|---|---|
glm() |
No separation, moderate-to-large sample | Fails silently under separation; never trust it without checking warnings and predicted probabilities |
Firth (logistf) |
Default for separated or sparse data; continuous predictors | Confidence intervals are still asymptotic; can be conservative for tiny samples |
Exact (elrm) |
Very small samples, rare events, regulatory submissions | Slow MCMC; categorical predictors only; CIs may have one-sided infinite bounds under complete separation |
Bayesian (rstanarm, brms) |
Larger samples with weak separation, want full posterior | Requires choosing priors and runtime is in seconds-to-minutes |
The decision is mostly about who you need to convince. Statisticians and regulators in clinical trials often expect exact p-values for small studies. Working data scientists usually prefer Firth because it slots into existing workflows. Bayesian methods shine when you also want a posterior distribution for downstream decision-making.
The contrast is stark. glm()'s estimate is off by an order of magnitude and its p-value is essentially 1, while Firth and exact agree that treatment has a strongly positive, statistically significant effect. The infinite upper bound on glm and elrm reflects the same underlying issue (no data above the treatment arm). Firth's penalty pulls the upper bound to a finite, useful number.
glm() gives a misleading answer. When your default method's output looks "weird" (huge SEs, infinite confidence bounds, p-values near 1), separation is the most likely explanation, not a real null effect.Try it: Add a row to compare_df for an artificial "glm rounded" method that replaces the absurd glm estimate with a censored marker like "diverged".
Click to reveal solution
Explanation: Reporting "diverged" instead of the meaningless 53 is the honest summary for a regulator or audit log.
Practice Exercises
Exercise 1: Detect separation programmatically
Write a function my_separation(model) that takes a fitted glm() object and returns TRUE if separation appears to have occurred. Use two heuristics: any standard error larger than 10 on the log-odds scale, or any predicted probability outside [1e-6, 1 - 1e-6]. Test it on m_glm (separated) and on a non-separated model fit on mtcars.
Click to reveal solution
Explanation: Both checks are needed, because quasi-complete separation can produce extreme predicted probabilities without enormous standard errors, and vice versa.
Exercise 2: Quasi-complete separation across three methods
Build a 30-row dataset where outcome is perfectly predicted by risk only inside group == "A" (quasi-complete separation), but is overlapping inside group == "B". Fit glm(), logistf(), and elrm() for outcome ~ risk + group, and report the three p-values for risk in a single tibble.
Click to reveal solution
Explanation: glm's broken p-value would let a quick reader miss the genuine effect of risk; Firth and exact both flag it as significant.
Putting It All Together
A practical workflow looks like this. Simulate data with a hidden separation issue, run glm() first as your default, diagnose, then escalate to Firth (or exact, if needed). Below is the whole loop on a fresh 30-row dataset with two predictors, one cleanly informative and one that triggers separation.
The diagnosis step caught the runaway standard error in m_diag. Switching to Firth produced a clean fit: age is small and uncertain (no real effect), while exposure is large, positive, and statistically meaningful with a finite standard error. This is the pattern to internalize. Run glm() first for speed, diagnose with summary() and predict(), then escalate to logistf() (or elrm() for regulatory work) only when the diagnostics flag a problem.
Summary
- Separation (complete or quasi-complete) breaks
glm()for logistic regression. R warns with"fitted probabilities numerically 0 or 1 occurred"but still prints output, so warnings must be checked. - Symptoms: absurdly large standard errors, predicted probabilities pinned at 0 or 1, p-values near 1 instead of near 0.
- Exact logistic regression conditions on sufficient statistics and computes the conditional p-value distribution; the
elrmpackage implements this with MCMC. - Firth's penalized regression (
logistf) is the practical default: fast, finite estimates, works for continuous predictors, and matches exact direction. - Choose by audience: Firth for everyday work, exact for regulatory submissions and very small samples, Bayesian when you also need a posterior.
- Always diagnose first. Run
glm(), inspectsummary()for huge SEs, and inspectpredict(type = "response")for boundary values before trusting any logistic regression output.
References
- Heinze, G. & Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21(16), 2409-2419. Link
- Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27-38. Link
- Mehta, C. R. & Patel, N. R. (1995). Exact logistic regression: theory and examples. Statistics in Medicine, 14(19), 2143-2160. Link
- Zamar, D., McNeney, B. & Graham, J. (2007). elrm: Software implementing exact-like inference for logistic regression. Journal of Statistical Software, 21(3). Link
- UCLA OARC. R Data Analysis Examples: Exact Logistic Regression. Link
- Heinze, G., Ploner, M. & Jiricka, L. logistf: Firth's Bias-Reduced Logistic Regression. CRAN reference manual. Link
- Zamar, D., McNeney, B., Graham, J. & Le Cessie, S. elrm vignette. CRAN. Link
Continue Learning
- Logistic Regression in R: the parent guide to fitting, interpreting, and diagnosing standard logistic regression with
glm(). - Multinomial Logistic Regression in R: extending logistic regression to outcomes with more than two categories.
- Ordinal Logistic Regression in R: the right tool when categories have a natural ordering, with its own separation pitfalls.