ANOVA Exercises in R: 15 One-Way & Two-Way Practice Problems, Solved Step-by-Step
These 15 ANOVA exercises in R cover one-way and two-way designs with runnable solutions, so you practise picking the right model, checking assumptions, interpreting F and p, running Tukey post-hoc, reading interaction effects, and reporting effect sizes end-to-end.
Which ANOVA design matches your data?
One R function, three different ANOVA designs. aov() fits a one-way ANOVA when the right-hand side has a single factor, a two-way additive model when you use +, and a full factorial model with interaction when you use *. The decision hinges on how many grouping variables you have and whether they might interact. Here is the same function used all three ways on the ToothGrowth dataset so you can see the shape of each output before drilling into the 15 exercises.
All three fits use the same function on the same data, only the formula changes. The one-way model looks at supplement alone and barely reaches significance (p = 0.060). Adding dose as a second factor reveals what was hidden: supplement does matter once you account for dose (p = 0.0004), and dose itself is massively significant. The interaction model adds one more row, supp:factor(dose) with p = 0.022, which says the effect of supplement is not the same at every dose level. The formula operator is all that changed, but the scientific conclusion changed with it.
Here is the one-line decision rule:
| Your design... | Formula | Fits |
|---|---|---|
| One grouping factor | y ~ g |
One-way ANOVA |
| Two factors, no interaction assumed | y ~ g1 + g2 |
Two-way additive |
| Two factors, interaction allowed | y ~ g1 * g2 |
Two-way with interaction |
| Three factors with full interactions | y ~ g1 * g2 * g3 |
Three-way factorial |
supp:dose is significant, the OJ-vs-VC difference is not the same at every dose. If it's non-significant, the additive model is enough and you can report main effects cleanly. Always fit the interaction first, then drop it only if it is clearly not needed.Try it: You have three grouping variables a, b, c and you want every two-way interaction but not the three-way. Which formula do you use? Set ex_formula to the correct string.
Click to reveal solution
Explanation: The ^2 operator expands to all main effects plus every two-way interaction but stops short of higher-order terms. y ~ a * b * c would give you all three main effects, all three two-way interactions, and the three-way interaction. Using ^2 is the clean way to cap interaction order.
How do you read aov() output and follow up with post-hoc?
summary(aov_fit) prints one row per effect and one residual row. The columns are always Df (degrees of freedom), Sum Sq (sum of squares, how much variation the effect explains), Mean Sq (sum of squares divided by df), F value (effect mean square over residual mean square), and Pr(>F) (the p-value). ANOVA only tells you whether the group means differ, not which pairs differ. For that you need a post-hoc test like Tukey's HSD.
The ANOVA row says some group differs (F(2, 27) = 4.85, p = 0.016), but it does not say which. Tukey's HSD fills that gap with all three pairwise comparisons, a diff (mean difference), a 95% confidence interval (lwr, upr), and a family-wise-adjusted p-value (p adj). Only trt2-trt1 is significant (p = 0.012), the 0.87 gram difference in plant weight between the two treatments is real. trt2-ctrl and trt1-ctrl both fail to reach significance on their own. The family-wise adjustment is what keeps the overall false-positive rate at 5% despite running three tests.
Here are the five columns of the summary table you need to know:
| Column | What it is | How to read it |
|---|---|---|
Df |
Degrees of freedom for the effect | Groups − 1 for one-way |
Sum Sq |
Variation explained by the effect | Bigger is more explanatory |
Mean Sq |
Sum Sq divided by Df | Variance-like quantity |
F value |
Effect MS over residual MS | Bigger → stronger signal |
Pr(>F) |
Tail probability under H₀ | The decision number |
summary(aov_fit) and anova(lm_fit) can give different answers for unbalanced two-way designs. aov() uses Type I (sequential) sums of squares, so the order of factors in the formula changes the p-values. When your design is unbalanced, switch to car::Anova(fit, type = 2) or type = 3 to get order-independent tests. For balanced designs (equal cells), all three types agree.broom::tidy() converts any aov or TukeyHSD object into a tidy data frame. broom::tidy(pg_fit) gives one row per term with statistic and p.value columns; broom::tidy(pg_tukey) gives one row per pair. Much easier to slot into reports or combine across models than parsing summary() text.Try it: Extract just the adjusted p-value for the trt2-trt1 comparison from pg_tukey. Save it to ex_p.
Click to reveal solution
Explanation: pg_tukey$group is a plain matrix with named rows. You can index it either by name ("trt2-trt1") or by position (3). The name-based index is self-documenting and survives re-ordering, so prefer it for reporting code.
Practice Exercises
Fifteen capstone problems, ordered roughly easier to harder. Every exercise uses a distinct ex<N>_ prefix so the solutions do not clobber tutorial variables.
Exercise 1: One-way ANOVA on PlantGrowth
Fit a one-way ANOVA predicting weight from group on the built-in PlantGrowth dataset. Save the fit to ex1_fit and print the summary. State whether you reject H₀ at α = 0.05.
Click to reveal solution
Explanation: F(2, 27) = 4.85 with p = 0.016 rejects the null that all three group means are equal. The three groups (control, treatment 1, treatment 2) do not all produce the same mean weight. Exercises 5 and 6 will locate which pair is responsible.
Exercise 2: Fit the same one-way ANOVA as a linear model
ANOVA is a linear model with a categorical predictor. Refit Exercise 1 using lm(), run anova() on it, and verify the F statistic and p-value match.
Click to reveal solution
Explanation: F = 4.85, p = 0.016, identical to summary(ex1_fit). ANOVA is a reparametrisation of a linear regression where the predictor is categorical. Once you internalise this, techniques like emmeans, contrasts(), and predict() become available for ANOVA-style analyses too.
aov() is a thin wrapper around lm() that reformats the output as an ANOVA table. This means the full R modelling ecosystem, residual plots, predict(), emmeans, robust standard errors, applies to ANOVA too. Memorising one framework unlocks both.Exercise 3: Check equal-variance assumption with Levene's test
One-way ANOVA assumes homogeneous variances across groups. Use car::leveneTest to test that assumption on PlantGrowth. Save the result to ex3_lev and state whether the assumption is met at α = 0.05.
Click to reveal solution
Explanation: Levene's p = 0.34 fails to reject equal variances, so the standard ANOVA's homoscedasticity assumption is safe for PlantGrowth. If Levene's had been significant, you would switch to Welch's one-way ANOVA (next exercise) instead of aov().
Exercise 4: Welch's one-way ANOVA (unequal variances)
When Levene's test flags unequal variances, oneway.test() with var.equal = FALSE (the default) runs Welch's approximation, which does not assume homoscedasticity. Run it on PlantGrowth and compare the F statistic to the standard ANOVA from Exercise 1.
Click to reveal solution
Explanation: Welch's F = 5.18 with p = 0.017, very close to the standard ANOVA's F = 4.85 and p = 0.016. When variances are roughly equal (as Levene's confirmed), Welch's and classical ANOVA agree. When variances differ sharply, Welch's is the safer default, it adjusts the denominator df downward to protect the error rate.
Exercise 5: TukeyHSD post-hoc on PlantGrowth
Rejecting ANOVA's null tells you some groups differ but not which. Run TukeyHSD() on ex1_fit and identify the pair with the smallest adjusted p-value.
Click to reveal solution
Explanation: Only trt2 − trt1 = 0.865 grams reaches significance (p adj = 0.012). Neither treatment differs from the control on its own. Tukey's HSD adjusts all three p-values so the family-wise Type I rate stays at 5%. Without that correction, running three naive t-tests inflates the false-positive rate to about 14%.
Exercise 6: Pairwise t-tests with Bonferroni adjustment
Run pairwise.t.test() with p.adjust.method = "bonferroni" on PlantGrowth and compare the adjusted p-values to Tukey's.
Click to reveal solution
Explanation: Bonferroni and Tukey both flag trt2 vs trt1 (p = 0.013 Bonferroni, 0.012 Tukey) and leave the other two comparisons non-significant. Bonferroni is more conservative in general (multiplies each p-value by the number of comparisons), but with only three pairs the two adjustments converge. For many comparisons, Tukey's HSD usually beats Bonferroni on power.
Exercise 7: Effect size, eta-squared by hand and via a package
The F statistic and p-value tell you whether the effect is real, not how big it is. Eta-squared (η²) is the proportion of total variance explained by the group factor. Compute it two ways: from the ANOVA sums of squares by hand, and with effectsize::eta_squared(). Compare the two.
The formula:
$$\eta^2 = \dfrac{\text{SS}_{\text{effect}}}{\text{SS}_{\text{total}}}$$
Where:
- $\text{SS}_{\text{effect}}$ is the effect's sum of squares from the ANOVA table
- $\text{SS}_{\text{total}}$ is the total sum of squares, effect plus residual
Click to reveal solution
Explanation: η² = 3.77 / (3.77 + 10.49) = 0.264, matching the package. About 26% of the variance in plant weight is explained by the group factor. Cohen's rough thresholds are 0.01 (small), 0.06 (medium), 0.14 (large), so 0.26 is a large effect, the significant p-value is backed by a meaningful magnitude.
Exercise 8: Two-way additive ANOVA on ToothGrowth
Fit aov(len ~ supp + factor(dose)) on ToothGrowth. Save to ex8_fit, print the summary, and report which main effects are significant.
Click to reveal solution
Explanation: Both main effects are highly significant. Supplement type contributes F = 14.0 (p < 0.001), and dose is enormous at F = 82.8 (p < 2e-16). This model assumes the supplement effect is the same at every dose (no interaction). Exercise 9 relaxes that assumption.
Exercise 9: Two-way ANOVA with interaction
Re-fit the Exercise 8 model allowing interaction: len ~ supp * factor(dose). Save as ex9_fit. Is the interaction significant? What does that mean scientifically?
Click to reveal solution
Explanation: The supp:factor(dose) interaction is significant (F = 4.11, p = 0.022). Scientifically, the OJ-vs-VC gap is not the same at every dose level. At low doses OJ substantially beats VC; at 2 mg, the two supplements produce similar growth. That nuance disappears in the additive model of Exercise 8, which is why you fit interactions first and drop them only when clearly unneeded.
Exercise 10: Interaction plot
Visualise the interaction from Exercise 9 using interaction.plot(). Plot factor(dose) on the x-axis, supp as separate lines, and mean len on the y-axis.
Click to reveal solution
Explanation: Non-parallel lines indicate an interaction. The OJ line sits well above the VC line at 0.5 and 1.0 mg, then the two converge at 2.0 mg where both supplements hit the biological ceiling. Parallel lines would say the two supplements maintain the same gap across all doses, which is the additive model of Exercise 8. The visible convergence at 2.0 mg is exactly what the significant interaction term is detecting.
Exercise 11: Simple-effects follow-up
When the interaction is significant, interpret each factor within levels of the other. Split ToothGrowth by supp and run a one-way ANOVA of len ~ factor(dose) within each subset. Save the fits to ex11_oj and ex11_vc.
Click to reveal solution
Explanation: Dose is strongly significant within each supplement, F = 32.0 for OJ and F = 67.1 for VC. Both supplements respond to dose, but the magnitude is larger for VC (higher F, bigger sum of squares). This is the numerical backing for the simple-effects interpretation: dose matters for both, but more steeply for VC, which is why the lines converge at the top.
Exercise 12: Kruskal-Wallis, the non-parametric alternative
If residuals are badly non-normal and transformations do not help, Kruskal-Wallis replaces ANOVA without the normality assumption. Run kruskal.test(weight ~ group, data = PlantGrowth) and compare the decision to the standard ANOVA in Exercise 1.
Click to reveal solution
Explanation: Kruskal-Wallis χ² = 7.99, p = 0.018, the same qualitative decision as ANOVA (p = 0.016). The rank-based test ignores the actual weight values and uses only their ranks, which costs some power when normality holds but pays off when distributions are skewed or have outliers. Use it as a robustness check and report both when they disagree.
Exercise 13: Planned contrast on PlantGrowth
Run a planned contrast comparing group trt2 to the average of ctrl and trt1 using the contrasts() function. The contrast vector is c(-1, -1, 2) (scaled so it sums to zero).
Click to reveal solution
Explanation: The trt2_vs_rest row tests whether trt2's mean is different from the average of ctrl and trt1. With p = 0.015, it is. This is more powerful than the corresponding Tukey pairs because it asks one focused question per contrast instead of all pairwise differences. Planned contrasts must be specified before seeing the data to be valid, otherwise you are back in post-hoc territory.
Exercise 14: Diagnostic plots for an ANOVA fit
Fit aov(len ~ supp * factor(dose), data = ToothGrowth) as ex14_fit, then produce the four standard diagnostic plots with plot(ex14_fit). Comment briefly on the residuals-vs-fitted and the normal QQ plots.
Click to reveal solution
Explanation: The residuals-vs-fitted panel checks constant variance and should show a roughly horizontal cloud, no funnel or bow shape. The normal QQ plot checks residual normality, points should lie close to the diagonal line. On ToothGrowth the QQ plot is nearly straight with slight deviation in the tails, and the residuals cloud has no systematic pattern. ANOVA is fairly robust to mild deviations, so modest curvature is usually acceptable; major deviations point to a transformation or a non-parametric alternative.
Exercise 15: Full report-ready pipeline
Combine everything into one pipeline on the ToothGrowth factorial design: fit the full interaction model, check Levene's, compute partial η², run TukeyHSD on the main effects, and tidy the ANOVA table with broom::tidy().
Click to reveal solution
Explanation: This is the full inferential workflow for a factorial design: check the assumption (Levene's p = 0.15, safe), fit the model, quantify each effect (dose is a huge partial η² of 0.77), pin down which dose levels differ (all three Tukey pairs are highly significant), and dump the whole ANOVA table into a tidy frame for downstream reporting. This one block is the template you can adapt to almost any two-factor experiment.
Complete Example: warpbreaks Factorial Design
The warpbreaks dataset records the number of warp breaks per loom for two wool types (A, B) at three tension levels (L, M, H). This is a 2 × 3 factorial design, perfect for an end-to-end walkthrough.
wool:tension interaction is significant (p = 0.021), which means the wool effect flips direction across tension levels (wool A is worse at L, similar at M, slightly better at H). Reporting only the main effect of wool would hide that reversal. Run simple effects or plot interaction means before drawing conclusions.An APA-style summary for this analysis: A two-way ANOVA examined the effects of wool type (A, B) and tension (L, M, H) on number of warp breaks per loom. Tension had a significant main effect, $F(2, 48) = 8.50, p < .001, \eta^2_p = .26$, and interacted significantly with wool, $F(2, 48) = 4.19, p = .021, \eta^2_p = .15$. Tukey HSD on the tension main effect showed medium and high tension produced significantly fewer breaks than low tension (both $p \leq .02$), while medium and high did not differ ($p = .39$). That one paragraph communicates the design, the effects, their sizes, and the post-hoc landing, the ideal write-up for a factorial ANOVA.
Summary
| Goal | R call | Key field |
|---|---|---|
| One-way ANOVA | aov(y ~ g) then summary() |
F value, Pr(>F) |
| Two-way additive | aov(y ~ g1 + g2) |
each main-effect row |
| Two-way with interaction | aov(y ~ g1 * g2) |
g1:g2 row |
| Variance check | car::leveneTest(y ~ g, data) |
p > 0.05 is safe |
| Welch's (unequal variances) | oneway.test(y ~ g) |
F, p |
| Post-hoc pairs | TukeyHSD(fit) |
p adj column |
| Bonferroni pairs | pairwise.t.test(y, g, p.adjust.method = "bonferroni") |
adjusted p-matrix |
| Non-parametric | kruskal.test(y ~ g) |
chi-squared, p |
| Effect size | effectsize::eta_squared(fit) |
Eta2 |
| Planned contrasts | contrasts(g) <- cbind(...) then lm() |
per-contrast t, p |
| Tidy output | broom::tidy(fit) |
data frame with one row per term |
References
- R Core Team. aov: Fit an Analysis of Variance Model, R Stats Reference. Link
- Fox, J. & Weisberg, S. An R Companion to Applied Regression, 3rd edition. SAGE (2019). Source of
car::leveneTestandcar::Anova. - Tukey, J. W. "The Problem of Multiple Comparisons." Unpublished manuscript, Princeton University (1953).
- Welch, B. L. "On the Comparison of Several Mean Values: An Alternative Approach." Biometrika, 38(3), 330-336 (1951).
- Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Lawrence Erlbaum (1988). Source of eta-squared thresholds.
- Kruskal, W. H. & Wallis, W. A. "Use of Ranks in One-Criterion Variance Analysis." Journal of the American Statistical Association, 47(260), 583-621 (1952).
- Robinson, D., Hayes, A., & Couch, S. broom: Convert Statistical Objects into Tidy Tibbles. Link
- Ben-Shachar, M. S., Lüdecke, D. & Makowski, D. effectsize: Estimation of Effect Size Indices and Standardized Parameters. Link
Continue Learning
- One-Way ANOVA in R: F-Test, Levene's Test, and Post-Hoc Complete Walkthrough covers the parent theory, the F-distribution derivation, assumption diagnostics, and post-hoc procedures in full.
- Hypothesis Testing Exercises in R widens the practice to t-tests, proportion tests, non-parametric alternatives, and power analysis.
- Multiple Regression Exercises in R uses the same drill-sheet format for continuous predictors and interaction terms in a regression framing.