GLM Exercises in R: 10 Poisson, Binomial & Gamma Practice Problems, Solved Step-by-Step
These 10 GLM exercises in R walk you through fitting glm() across the three workhorse families, Poisson for counts, Binomial for yes/no or proportion data, and Gamma for positive-skewed continuous responses. Every problem ships with a runnable starter, a hint, and a click-to-reveal solution so you can practise, self-check, and see the reasoning.
How do you fit your first GLM in R?
The glm() function is base R's one-stop generalized linear model fitter. Hand it a formula, pick a family, and you get a model you can summarise, predict from, and test. The coefficients land on the link scale, which is usually log for Poisson and Gamma, and logit for Binomial. We start with a Poisson fit on the built-in warpbreaks dataset so every number below is reproducible in your browser.
Every coefficient is on the log scale. The -0.206 on woolB means wool B has a log-count that is 0.206 lower than wool A, which turns into a rate ratio of exp(-0.206) = 0.814, so wool B breaks about 19% less often than wool A at the same tension. The two tension coefficients say medium and high tension reduce breaks below the low-tension baseline, and both are comfortably significant.
exp(beta) is a multiplicative rate ratio. For Binomial with a logit link, exp(beta) is an odds ratio. Always label which scale you are reading.Try it: Compute the rate ratio for woolB from pois_fit. Confirm it matches the 0.814 mentioned above.
Click to reveal solution
Explanation: Exponentiating a log-link coefficient converts it to a rate ratio. Wool B produces 81.4% of the breaks that wool A does at the same tension, all else equal.
How do you pick the right family and link?
Picking the family is the one decision that matters most in a GLM. Get it wrong and every coefficient, p-value, and prediction is built on the wrong distribution. The table below is the short version; the code right after shows a minimal fit for each of the three families on data you can inspect.
| Family | Typical response | Default link | Example |
|---|---|---|---|
| Poisson | Non-negative counts | log | breaks per loom, calls per hour |
| Binomial | Yes/no or proportion | logit | admitted vs not, click vs no-click |
| Gamma | Positive continuous, right-skewed | inverse (often log in practice) | claim sizes, time-to-event |
The Binomial fit recovered coefficients close to the true -6, 1.5, 0.004, and the Gamma-log fit pulled back the true 5, 0.02, 0.4. Those three glm() calls are all you need; only the family argument changes.
family = ?(link = ?) as two independent choices. Poisson-log and Gamma-log share the log link, so their coefficients read the same way even though the response distributions differ.Try it: Match each response to a family: (a) number of earthquakes per month in a region, (b) whether a customer renewed their subscription, (c) the dollar amount of an accepted insurance claim.
Click to reveal solution
Explanation: Counts per unit time go to Poisson. A binary outcome goes to Binomial. A positive, right-skewed amount goes to Gamma. If you hesitated on (c), imagine the histogram: money never goes below zero and the right tail is fat, which is the Gamma signature.
Practice Exercises
The ten exercises below use the three objects already in your session: wb (warpbreaks), admits (simulated binary admissions), and claims (simulated insurance claim amounts). Each solution uses a distinct ex{N}_ prefix so your tutorial fits (pois_fit, bin_fit, gam_fit) stay intact.
Exercise 1: Fit a Poisson GLM on warpbreaks and report the tension L coefficient
Refit the Poisson model but drop wool, predicting breaks ~ tension alone. Report the intercept, the coefficient on tensionM, and the coefficient on tensionH. Which tension level has the highest baseline break rate?
Click to reveal solution
Explanation: The intercept is the baseline log-rate for tension L. Both M and H have negative coefficients, so tension L has the highest break rate. exp(3.601) = 36.6 is the expected breaks per loom at low tension, averaged across wool types.
Exercise 2: Turn the Poisson coefficients into rate ratios with 95% CIs
Using pois_fit (the original two-predictor model), produce a tidy table of rate ratios and their 95% confidence intervals. A rate ratio's CI crosses 1 when the effect is not statistically significant.
Click to reveal solution
Explanation: None of the CIs contain 1, so every predictor moves the break rate significantly. Wool B has 81% of wool A's break rate; tension H has 60% of tension L's, roughly a 40% reduction. confint.default() uses Wald intervals which are fast and almost always match confint() here.
Exercise 3: Check for overdispersion and refit with quasipoisson if needed
Poisson regression assumes the mean equals the variance. When that fails, standard errors are wrong. Compute the Pearson dispersion (sum of squared Pearson residuals over residual degrees of freedom). If it exceeds ~1.2, refit with family = quasipoisson and compare the standard errors.
Click to reveal solution
Explanation: The dispersion of 3.83 is far above 1, so the Poisson standard errors were too small. Refitting with quasipoisson inflates every SE by roughly sqrt(3.83) ~ 1.96. The coefficients do not change, only their uncertainty. Several p-values that looked tiny under Poisson will now be borderline.
Exercise 4: Predict expected break count on the response scale
Using pois_fit, predict the expected breaks per loom when wool = "A" and tension = "M". Use type = "response" so the answer is in break counts, not log-counts.
Click to reveal solution
Explanation: predict() with type = "response" applies the inverse link (exponential here) automatically, so the number is directly interpretable. You can double-check by hand: exp(3.692 + 0 + (-0.321)) = 29.2.
Exercise 5: Fit a Binomial GLM and report the gpa log-odds coefficient
Fit y ~ gpa + gre on admits with a logit link. Report the gpa coefficient (a log-odds ratio) and whether it is significantly different from 0.
Click to reveal solution
Explanation: Both gpa and gre are highly significant. A one-point increase in gpa raises the log-odds of admission by 1.513, holding gre fixed. Log-odds are awkward to interpret directly, which is why Exercise 6 converts them to odds ratios.
Exercise 6: Convert Binomial coefficients to odds ratios
Exponentiate the ex5_fit coefficients to produce odds ratios with 95% confidence intervals. Report the gpa odds ratio and sanity-check it (should be clearly above 1).
Click to reveal solution
Explanation: A one-unit rise in gpa multiplies the odds of admission by 4.54 on average. The CI (3.07, 6.72) is nowhere near 1, confirming the effect. The gre OR looks small (1.004) because gre moves in units of one test point; multiply the coefficient by 100 for a more interpretable "per 100 points" effect.
Exercise 7: Predict admission probability for a specific applicant and threshold it
Predict the admission probability for an applicant with gpa = 3.5 and gre = 700. Then threshold at 0.5 to produce a predicted class.
Click to reveal solution
Explanation: The applicant has an estimated 71.6% chance of admission, so at a 0.5 threshold they are classified as admitted. Thresholds other than 0.5 make sense when false positives and false negatives have different costs; we stick with 0.5 here for simplicity.
Exercise 8: Fit a Gamma GLM with a log link on claim amounts
Fit amount ~ age + region on claims using family = Gamma(link = "log"). Report the regionurban coefficient and translate it into a multiplicative effect on expected claim size.
Click to reveal solution
Explanation: Urban claims are 46.8% larger than rural claims at the same age. The age coefficient of 0.020 on the log scale means each extra year of age multiplies expected claim size by exp(0.020) = 1.02, a 2% increase per year.
amount throws Error: non-positive values not allowed for the 'Gamma' family. Filter or shift your response before fitting.Exercise 9: Extract the Gamma dispersion parameter and explain what it measures
summary() of a Gamma GLM prints a dispersion parameter. Extract it programmatically and explain what a value near 0.25 implies about the shape of the residual distribution.
Click to reveal solution
Explanation: For a Gamma GLM, the dispersion equals 1 / shape. Our simulation used shape = 4, so the true dispersion is 1/4 = 0.25 and the fit recovered 0.263. The smaller the dispersion, the tighter the Gamma distribution around its mean. A dispersion of 1 collapses Gamma to an exponential distribution, a much heavier right tail.
Exercise 10: Compare Gamma-log to Gaussian-log via AIC
Fit the same formula (amount ~ age + region) once with family = Gamma(link = "log") and once with family = gaussian(link = "log"). Compare their AICs; which family fits better?
Click to reveal solution
Explanation: The Gamma fit has a lower AIC by roughly 197 points, a decisive win. Both models use the same linear predictor and link, but Gamma's right-skewed error distribution matches simulated claim data better than Gaussian's symmetric bell curve. Gaussian with a log link is still useful as a sanity check but the Gamma is the right story here.
Complete Example: end-to-end Gamma workflow on claims
This walks through the full arc for a Gamma GLM: peek at the data, fit, check dispersion, predict, and interpret. Running it gives you a template you can paste into your next claims, time-on-task, or blood-concentration analysis.
The fit recovered the signal we simulated, the dispersion matches the true shape, and a 45-year-old urban customer is predicted to have an expected claim near 535. Each step maps one-for-one to what you would run on a real claims dataset; only the data import changes.
summary() output before trusting predictions. A Gamma GLM that reports non-positive-values errors or a dispersion near 0 usually means the response is mis-scaled or contains zeros you forgot to filter. Fix those first and the fit becomes trustworthy.Summary
One table, three families, one glm().
| Family | Typical response | Default link | glm() call |
When to use |
|---|---|---|---|---|
poisson |
Non-negative counts | log | glm(y ~ x, family = poisson) |
Counts per unit time or exposure |
quasipoisson |
Overdispersed counts | log | glm(y ~ x, family = quasipoisson) |
Counts with variance > mean |
binomial |
0/1 outcome or proportion | logit | glm(y ~ x, family = binomial) |
Yes/no, click/no-click, cure/no-cure |
Gamma |
Positive continuous, right-skewed | inverse (log in practice) | glm(y ~ x, family = Gamma(link = "log")) |
Claim sizes, durations, concentrations |
gaussian |
Continuous unbounded | identity | glm(y ~ x, family = gaussian) |
Equivalent to lm(), useful for link comparisons |
Four habits that separate a careful GLM user from a careless one:
- Always label which scale a coefficient is on before interpreting it.
- Check dispersion for Poisson and Binomial; refit as
quasi*if inflated. - Use
type = "response"for predictions you plan to show anyone else. - Compare competing families with AIC, not just by eyeballing coefficients.
References
- R Core Team, stats::glm reference documentation. Link
- Faraway, J. Extending the Linear Model with R, 2nd ed., CRC Press (2016). Chapters 3-5 on Binomial, Poisson, and Gamma GLMs.
- Agresti, A. Foundations of Linear and Generalized Linear Models, Wiley (2015).
- Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S, Springer (2002). Chapter 7: Generalized Linear Models.
- Dunn, P. K. and Smyth, G. K. Generalized Linear Models With Examples in R, Springer (2018).
- UCLA Advanced Research Computing, GLM examples in R. Link
- Bolker, B. et al. GLMM FAQ, dispersion and overdispersion discussion. Link
Continue Learning
- Poisson Regression in R, the parent tutorial on count regression, including offsets and zero-inflation.
- Logistic Regression in R, the Binomial GLM in full, with classification thresholds, ROC curves, and decision analysis.
- Regression Diagnostics in R, residual checks, influence, and leverage diagnostics that apply to GLMs too.