Chi-Square Test Exercises in R: 10 Independence & Goodness-of-Fit Problems, Solved Step-by-Step
These 10 chi-square test exercises in R cover both goodness-of-fit and independence problems with runnable solutions, so you practise picking the right variant, reading the output, inspecting residuals, computing Cramér's V, and falling back to Fisher's exact when expected counts are small.
Which chi-square test matches your data layout?
One chi-square function in R, two very different questions. chisq.test() runs a goodness-of-fit test when you hand it a single count vector, and a test of independence when you hand it a two-way table. The decision hinges on whether you have one categorical variable or two. Here is the same function used both ways on two tiny datasets so you can see both calls before drilling into the 10 exercises.
Both calls use the exact same function, and both return an object with the same fields ($statistic, $parameter, $p.value, $expected, $residuals, $stdres). The coin-flip p-value of 0.32 fails to reject the fair-coin hypothesis, while the 2×2 treatment table p-value of 0.014 flags a real association between drug and outcome. One function, one workflow, one result shape. The shape of the input is what switches between the two tests.
Here is the one-line decision rule:
| You have... | Question | R call |
|---|---|---|
| One categorical variable + an expected distribution | Goodness of fit | chisq.test(counts, p = expected) |
| Two categorical variables in a two-way table | Independence / homogeneity | chisq.test(two_way_table) |
chisq.test() without Yates' correction matches prop.test(..., correct = FALSE) exactly. Pick whichever framing communicates the result best for your audience.Try it: A bag of M&Ms is supposed to contain Brown/Yellow/Red/Blue/Orange/Green in equal proportions. You open a bag and count the colours. Do you want a goodness-of-fit test or a test of independence? Set ex_choice to "gof" or "independence".
Click to reveal solution
Explanation: You have one categorical variable (colour) with 6 levels, and an expected distribution (equal proportions). That's textbook goodness-of-fit. If you instead compared colour counts across two different bags or two factories, that would be independence/homogeneity.
How do you read chi-square output in R?
chisq.test() returns a list with everything you need: the test statistic, degrees of freedom, p-value, observed and expected counts, Pearson residuals, and standardised residuals. Most of the interesting reporting lives inside those fields, not in the printed block. Pulling them out by name gives one-line access for write-ups, assumption checks, and residual plots.
A statistic of 138.3 on 9 df (rows−1 × cols−1 = 3 × 3) gives a p-value of about $2.3 \times 10^{-25}$: hair and eye colour are not independent. The expected-count matrix shows what the null hypothesis would predict. If rows and columns were independent, Black/Brown-eyed would sit near 40 instead of the observed 68. The standardised residuals pin the effect to specific cells: Black/Brown is +6.2 standard deviations above expectation, Blond/Blue (not shown in this slice) is the other driver. You will reproduce that in Exercise 8.
Here are the five fields you will reach for most often:
| Field | What it is | When to use | ||
|---|---|---|---|---|
$statistic |
The chi-square statistic $X^2$ | Report alongside df and p-value | ||
$parameter |
Degrees of freedom | Needed to compute or verify p manually | ||
$p.value |
Two-sided tail probability | The decision number | ||
$expected |
Expected counts under H₀ | Assumption check (all ≥ 5) | ||
$stdres |
Standardised residuals | Find cells that drive the effect, | z | > 2 |
broom::tidy() turns any R test into a one-row data frame. broom::tidy(he_res) gives a neat statistic / p.value / parameter / method row that slots straight into reports, markdown tables, or a bigger grid of models. Great when you are running the same test across many subsets.chisq.test() prints "Chi-squared approximation may be incorrect", stop and check $expected. That warning appears when any expected cell is below 5 and the large-sample approximation is unreliable. Switch to simulate.p.value = TRUE (any size) or fisher.test() (2×2) instead of reporting the printed number.Try it: Extract just the p-value from a goodness-of-fit test on mtcars$cyl against the uniform probabilities c(1/3, 1/3, 1/3). Save it to ex_p.
Click to reveal solution
Explanation: table(mtcars$cyl) gives counts 11, 7, 14 for 4/6/8 cylinders across 32 cars. Expected uniform = 10.67 each. The chi-square statistic is about 2.31 on 2 df, and $p = 0.31$, so we fail to reject uniform. The 32-car sample is simply too small to rule out uniformity even though the observed counts look uneven.
Practice Exercises
Ten capstone problems, ordered roughly easier → harder. Every exercise uses a distinct ex<N>_ prefix so solutions do not clobber earlier tutorial state.
Exercise 1: Goodness-of-fit with equal expected counts
Test whether the 4/6/8 cylinder distribution in mtcars is uniform. Save the test object to ex1_res and print it. State the decision against α = 0.05.
Click to reveal solution
Explanation: Expected counts are 10.67 each. The observed 11/7/14 are within about ±3 of expected, so the statistic is modest (2.31) and p = 0.31. We fail to reject uniformity. The lesson: an uneven-looking sample is not automatically evidence against uniformity, small samples demand more disagreement before rejecting.
Exercise 2: Goodness-of-fit with unequal expected (Mendelian 9:3:3:1)
Classic pea-plant data: in a dihybrid cross, Mendelian theory predicts the four phenotypes in ratio 9:3:3:1. Observed counts are c(315, 108, 101, 32). Test whether the observed counts are consistent with Mendel's ratio.
Click to reveal solution
Explanation: Total n = 556. Expected counts under 9:3:3:1 are 312.75, 104.25, 104.25, 34.75, within fractions of the observed values. The p-value of 0.93 says the data are fully consistent with Mendel's ratio. This dataset is famous because Fisher later argued the fit was too good, but for our purposes it's a textbook pass.
Exercise 3: Goodness-of-fit with a Monte Carlo p-value
You have 6 weekday-preference counts from a small survey: c(3, 4, 2, 5, 3, 2, 1) for Monday through Sunday. With such small expected counts (average ≈ 2.9), the chi-square approximation is unreliable. Use a Monte Carlo p-value via simulate.p.value = TRUE and B = 10000.
Click to reveal solution
Explanation: Under uniform preference each expected count is 2.86, below the usual ≥5 threshold. The simulated p-value (~0.74) says observed counts at least this uneven would happen about 74% of the time under uniform, no evidence for a weekday preference.
B. With B = 10000 the Monte Carlo error is roughly $1/\sqrt{10000} = 0.01$; raise B for tighter estimates. df prints as NA because the simulated test does not use the asymptotic chi-square distribution.Exercise 4: 2×2 independence with Yates' correction (default)
From HairEyeColor, build a 2×2 table of Hair (Black vs Blond) × Sex (Male vs Female). Run chisq.test() with default settings (Yates' continuity correction is applied automatically for 2×2). Report the statistic, df, and p-value.
Click to reveal solution
Explanation: Among Black-haired people the sample is slightly male-leaning (56/108), while Blonds skew female (81/127). The Yates-corrected p-value of 0.020 rejects independence at α = 0.05, there is a Hair × Sex association in this classic dataset.
Exercise 5: Same 2×2 without Yates' correction
Re-run the test from Exercise 4 with correct = FALSE. Compare the statistic and p-value, which direction does the correction move them, and by how much?
Click to reveal solution
Explanation: Yates' correction shrinks each observed-minus-expected difference by 0.5 before squaring, which lowers the statistic (from 5.99 to 5.40) and raises the p-value (from 0.014 to 0.020). The correction makes the 2×2 test more conservative, easier to miss real effects, harder to raise false alarms. For small 2×2 tables the correction is standard; for large ones it barely matters.
Exercise 6: R×C independence on mtcars (cyl × gear)
Build the table of cylinders × gears from mtcars using table(), and test independence. Expect a small-count warning, this exercise sets up Exercise 7.
Click to reveal solution
Explanation: The 3×3 table has several small cells (the 8 cyl / 4 gear cell is 0), so R warns "Chi-squared approximation may be incorrect". The point estimate (p = 0.0012) still suggests strong dependence, eight-cylinder engines pair with 3-speed gearboxes, four-cylinders with 4-speeds, but treat the exact p-value with caution. A Monte Carlo variant would confirm whether the effect is robust to the small-cell issue.
Exercise 7: Extract observed, expected, residuals into a tidy data frame
Take the result object from Exercise 6 (ex6_res) and pull observed, expected, residuals, and stdres into a single tidy data frame with row and column labels. The goal is a report-ready per-cell breakdown.
Click to reveal solution
Explanation: Flattening the four matrices into one data frame gives you every diagnostic per cell at a glance. The largest positive stdres (+3.82 for 8 cyl / 3 gear) and the largest negative (-3.12 for 4 cyl / 3 gear) are the cells that produce the significant overall p-value. That's the information you cannot recover from print(ex6_res) alone.
Exercise 8: Find the residual hotspots in HairEyeColor
Build the 4×4 Hair × Eye table from HairEyeColor collapsed over Sex, run an independence test, and return the hair-eye combinations where the standardised residual exceeds 2 in absolute value. Save the hotspots to ex8_hot.
Click to reveal solution
Explanation: The biggest hotspot, Blond+Blue with stdres = +11.2, says that combination appears about 11 standard deviations more often than independence would predict. Reporting the three or four biggest residuals alongside the test statistic is good practice, it turns a number into an interpretable finding.
Exercise 9: Effect size, Cramér's V on HairEyeColor
The chi-square statistic grows with sample size, so "significant" ≠ "meaningful". Cramér's V rescales it to the range \[0, 1\], interpretable the way a correlation is. Compute V for the HairEyeColor Hair × Eye table using the formula below and interpret against Cohen's thresholds.
The formula is:
$$V = \sqrt{\dfrac{\chi^2}{n \cdot (k - 1)}}$$
Where:
- $\chi^2$ is the test statistic from
chisq.test() - $n$ is the total sample size, $\sum n_{ij}$
- $k$ is $\min(\text{rows}, \text{cols})$
Click to reveal solution
Explanation: With $\chi^2 = 138.3$, $n = 592$, and $k = 4$, the formula gives $V = \sqrt{138.3 / (592 \cdot 3)} = 0.279$. Reporting V alongside the p-value prevents the common trap of mistaking tiny, significant effects in huge samples for practically important ones.
Exercise 10: Raw data → xtabs() → chi-square
Until now every exercise started from a pre-built table. Real data usually arrives one row per observation. Build the table with xtabs(), then run the independence test on it.
Click to reveal solution
Explanation: xtabs() takes a formula and a data frame and returns a contingency table in exactly the shape chisq.test() wants. This is the pipeline for any real-world categorical analysis: tidy long data → xtabs() → chisq.test(). The synthetic cohort here gives a Yates-corrected p-value below 0.001, strong evidence that smoking and disease are associated.
Complete Example: Hair × Eye Colour Survey
Let's stitch everything together: build the table, check assumptions, run the test, quantify the effect size, and localise the drivers.
A one-paragraph APA-style write-up: A chi-square test of independence examined whether hair colour and eye colour were associated in the 592-person HairEyeColor dataset. The relationship was highly significant, $\chi^2(9, N = 592) = 138.3, p < .001$, with a large effect size, $V = 0.28$. Examination of standardised residuals indicated that the Blond/Blue combination was strongly over-represented (stdres = +11.2) while Blond/Brown was under-represented (stdres = −9.4). That sentence answers three questions at once: is the effect real, is it big, and where does it live.
Summary
| Ask | R call | Key field |
|---|---|---|
| Goodness of fit | chisq.test(x, p = expected) |
$p.value |
| Independence / homogeneity | chisq.test(two_way_table) |
$p.value |
| 2×2 without Yates | chisq.test(tab, correct = FALSE) |
$statistic |
| Small expected counts | chisq.test(..., simulate.p.value = TRUE, B = 10000) |
$p.value |
| Effect size | sqrt(chisq / (n * (min(dim) - 1))) |
Cramér's V |
| Cell-level pattern | $stdres, flag abs(stdres) > 2 |
standardised residuals |
References
- R Core Team. chisq.test, R Stats Reference. Link
- Agresti, A. Categorical Data Analysis, 3rd edition. Wiley (2013).
- Sheskin, D. J. Handbook of Parametric and Nonparametric Statistical Procedures, 5th edition. Chapman & Hall/CRC (2011).
- Mendel, G. Experiments in Plant Hybridisation (1866), source of the classic 9:3:3:1 phenotype data.
- Yates, F. "Contingency Tables Involving Small Numbers and the χ² Test." Journal of the Royal Statistical Society Supplement 1(2), 1934.
- Cramér, H. Mathematical Methods of Statistics. Princeton University Press (1946).
- Fisher, R. A. Statistical Methods for Research Workers, 14th edition. Oliver & Boyd (1970).
Continue Learning
- Chi-Square Tests in R: Independence, Goodness-of-Fit & Effect Size covers the parent theory, assumption derivations, residual analysis, and Cramér's V in detail.
- Hypothesis Testing Exercises in R gives broader inference practice across t-tests, proportions, non-parametrics, and power analysis.
- t-Test Exercises in R uses the same drill-sheet format for comparing means instead of proportions.