Log-Linear Models in R: Model Tables of Counts with glm(poisson)
Log-linear models treat each cell of a contingency table as a Poisson count, then use glm(family = poisson) to explain those counts from the categorical variables that index the table. The same fit answers three questions in one go: are the variables independent, partially associated, or fully interacting?
What does a log-linear model do?
You already know how to compare two categorical variables with a chi-squared test. A log-linear model is the regression-shaped sibling of that test. It writes every cell count as a Poisson outcome and lets you switch independence, partial association, or full interaction on and off as model terms. That makes it easy to compare hypotheses, pick the simplest one that fits, and read the strength of associations off the coefficients.
Here is the payoff in one block. We collapse the built-in HairEyeColor 3-way table over Sex, turn the resulting 2-way table into a long data frame, and fit the independence model Freq ~ Hair + Eye with glm(family = poisson). The fitted counts and residual deviance tell us whether hair colour and eye colour are independent in this data.
A residual deviance of 146.4 on 9 degrees of freedom is enormous. Under the null hypothesis of independence, the deviance should be roughly chi-squared with 9 df, so we expect a value near 9. We are seeing 146 instead, which is overwhelming evidence that hair and eye colour are not independent. The model is too small.
glm() fits both, you only change the formula.Try it: Collapse HairEyeColor over Hair instead, and fit the independence model Freq ~ Eye + Sex. Print the residual deviance and df, and decide whether eye colour and sex are independent.
Click to reveal solution
Explanation: Deviance of 2.45 on 3 df is right around the expected value, so eye colour and sex are statistically independent in this data. Hair and eye colour were not, but eye colour and sex are.
How do you turn a contingency table into a data frame for glm()?
glm() does not consume tables, it consumes data frames with one row per observation, or in our case, one row per cell. R already has the right tool: calling as.data.frame() on a table object returns a long data frame with one column per dimension and a final Freq column holding the counts.
A 4 x 4 x 2 table becomes a 32-row data frame, exactly one row per cell, with the count in Freq. The categorical dimensions arrive as factors, which is what glm() expects.
Freq column name. as.data.frame.table() always names the count column Freq, and every log-linear example you read online uses that name. Don't rename it unless you have a strong reason, the formula Freq ~ ... is shorter and easier to share.Try it: Convert UCBAdmissions (a 2 x 2 x 6 table of Berkeley graduate admissions) to a long data frame, name it ex_ucb_df, and confirm it has 24 rows.
Click to reveal solution
Explanation: The 2 (Admit) x 2 (Gender) x 6 (Dept) table flattens to 24 rows, one per cell. This is the exact input shape glm(family = poisson) needs.
What's the difference between independence, partial association, and the saturated model?
For a 3-way table indexed by hair $i$, eye $j$, and sex $k$, the log-linear model writes the expected cell count as
$$\log \mu_{ijk} = \lambda + \lambda^H_i + \lambda^E_j + \lambda^S_k + \lambda^{HE}_{ij} + \lambda^{HS}_{ik} + \lambda^{ES}_{jk} + \lambda^{HES}_{ijk}.$$
Where:
- $\mu_{ijk}$ is the expected count in cell $(i, j, k)$
- $\lambda$ is an overall intercept
- $\lambda^H_i, \lambda^E_j, \lambda^S_k$ are main effects (the marginal totals)
- $\lambda^{HE}_{ij}, \lambda^{HS}_{ik}, \lambda^{ES}_{jk}$ are pairwise associations
- $\lambda^{HES}_{ijk}$ is the three-way association
You build a hierarchy of models by switching some of those interaction terms off. The four models below, independence, one-association, homogeneous-association, and saturated, span the typical workflow.
The deviance drops from 166 (independence) to 19 (only Hair-Eye associated) to 6.8 (all 2-way interactions), and finally hits 0 for the saturated model. The pattern tells you that adding the Hair-Eye interaction explains most of the misfit; the rest of the 2-way terms add a little more; the 3-way is unnecessary.
Try it: Fit a partial-association model where Hair and Sex are associated, but Eye is independent of both. Use hec_df. Save it as ex_partial, then print its residual deviance and df.
Click to reveal solution
Explanation: Compared to the mutual-independence model (deviance 166), this one only saved 10 deviance points by adding the Hair-Sex interaction. That confirms Hair-Sex is a weak association in this data; Hair-Eye is where the action is.
How do you choose between log-linear models?
Two nested models can be compared with a likelihood-ratio test. In R that is anova(small, big, test = "Chisq"). The reported deviance drop is approximately chi-squared, and a small p-value means the larger model fits significantly better. AIC offers a complementary view that does not require nesting.
Read the anova() output as three sequential tests. Going from independence to HairEye saves 147 deviance points on 9 df, extremely significant, so we keep the HairEye term. Going from m2 to all 2-way interactions saves only 12 more points on 6 df, with p = 0.058: borderline, but not significant. Going from m3 to fully saturated saves nothing meaningful. AIC agrees: m3 is best by a hair, but m2 is essentially tied and far simpler. Pick m2 unless the extra 2-way terms have a substantive interpretation you care about.
Try it: Use anova() to compare just m1 (independence) against m2 (Hair*Eye). Report the deviance drop and the p-value, and state whether the Hair-Eye interaction is needed.
Click to reveal solution
Explanation: A deviance drop of 147 on 9 df with p effectively zero says clearly that Hair and Eye are not independent. The Hair*Eye term must stay in the model.
How do you interpret log-linear model coefficients?
In a Poisson GLM with log link, each coefficient is a difference on the log scale. For a log-linear model on a contingency table, that gives the coefficients two clean readings: main effects describe marginal frequencies, and two-way interaction coefficients are log odds ratios for the corresponding 2 x 2 sub-table.
The Hair*Eye interaction HairBlond:EyeBlue exponentiates to 4.4. That number is the odds ratio comparing Blond vs. Black hair across Blue vs. Brown eyes: blond people are about 4.4 times more likely to have blue eyes (relative to brown) than black-haired people. You can cross-check this directly from the 2 x 2 sub-table of the original data; the model has just packaged that ratio as a coefficient.
Try it: Take the coefficient HairRed:EyeBlue from m2, exponentiate it, and explain what the resulting odds ratio means.
Click to reveal solution
Explanation: The odds ratio of 0.26 means that, for blue versus brown eyes, the odds ratio of red versus black hair is about 0.26. In plain English: among blue-eyed people, the share of red-haired people is far smaller than the share of black-haired, compared to the same ratio among brown-eyed people.
Practice Exercises
Exercise 1: Female-only Hair x Eye
Subset HairEyeColor to females only with HairEyeColor[, , "Female"], convert to a data frame f_df, fit the independence model and the saturated 2-way model, and use anova() to decide which one fits better.
Click to reveal solution
Explanation: Among females only, hair and eye colour are still strongly associated. The saturated model wins decisively; independence is rejected.
Exercise 2: Titanic 4-way table
Convert Titanic to a data frame titanic_df and fit a homogeneous-association model (all 2-way interactions, no 3-way or 4-way). Compare it to the saturated model with anova().
Click to reveal solution
Explanation: Homogeneous association is rejected; there are higher-order interactions in the Titanic data. That matches the famous Class-Sex-Survived three-way pattern (women in third class fared worse than the marginal Sex-Survived odds ratio would predict).
Exercise 3: Conditional independence in UCBAdmissions
Test whether admission and gender are conditionally independent given department. The model Freq ~ Admit*Dept + Gender*Dept says "Admit and Gender each depend on Dept, but they are independent given Dept." Compare it against the saturated model.
Click to reveal solution
Explanation: Conditional independence is rejected at p = 0.001: even after accounting for department, there is residual association between admission and gender. That is mostly driven by department A, where female applicants had a higher admission rate. The famous Berkeley "Simpson's paradox" is mostly explained by department mix, but a residual department-A effect remains.
Complete Example
A start-to-finish workflow on the Titanic data: convert the table, fit a ladder of nested log-linear models, compare them with anova() and AIC, and interpret one notable interaction.
anova() justifies. For Titanic, you almost always want at least all 2-way interactions, because Class, Sex, Age, and Survival genuinely interact.The deviance drops are huge at every step, and the saturated model still wins on AIC, meaning some 3-way or 4-way structure is real. The Sex-Survived odds ratio of 11.3 says women's odds of surviving were roughly 11 times those of men, holding the other 2-way associations fixed. That single coefficient summarises "women and children first."
Summary
| Model | glm formula | What it says | When to use |
|---|---|---|---|
| Mutual independence | Freq ~ A + B + C |
All variables independent | Baseline / null hypothesis |
| Joint / partial | Freq ~ A*B + C |
One pair associated, rest independent | When subject-matter says only one association exists |
| Homogeneous | Freq ~ (A + B + C)^2 |
All 2-way associations, no 3-way | Default reasonable model for 3-way tables |
| Conditional indep. | Freq ~ A*C + B*C |
A and B independent given C | Testing whether C "explains" the A-B association |
| Saturated | Freq ~ A*B*C |
Reproduces the table exactly | Reference baseline for anova() only |
Workflow: build the data frame with as.data.frame(), fit a ladder of nested models with glm(family = poisson), compare with anova(..., test = "Chisq") and AIC(), and exponentiate two-way interaction coefficients to read odds ratios.
References
- Agresti, A., Categorical Data Analysis, 3rd ed. Wiley (2013). Chapter 9: Loglinear Models for Contingency Tables.
- Faraway, J.J., Extending the Linear Model with R, 2nd ed. Chapter 4: Models for Counts. CRC Press (2016).
- R Core Team,
?glmreference page. Link - Venables, W.N. & Ripley, B.D., Modern Applied Statistics with S, 4th ed. Chapter 7. Springer (2002).
- UVa Library, An Introduction to Loglinear Models. Link
- Crawley, M.J., The R Book, 2nd ed., Chapter 15. Wiley (2013).
- Penn State STAT 504, Loglinear Models lesson notes. Link
Continue Learning
- Poisson and Negative Binomial Regression in R, the parent post; log-linear is the contingency-table specialisation.
- Chi-Squared Test in R, the simpler, hypothesis-test cousin for 2-way tables.
- Multinomial and Ordinal Logistic Regression in R, when one of the variables is a clear response, switch from joint frequencies to conditional probabilities.