Nonparametric Tests Exercises in R: 10 Practice Problems, Solved Step-by-Step
These 10 nonparametric tests exercises in R cover one-sample Wilcoxon signed-rank, two-sample Mann-Whitney U, paired signed-rank, Kruskal-Wallis for three or more groups, post-hoc pairwise comparisons, Hodges-Lehmann confidence intervals, tie handling, and rank-based effect sizes, each one runnable in your browser.
Which non-parametric test matches your design?
One of three R calls covers 90% of rank-based problems. The choice depends only on how your data is laid out: one group, two independent groups, two paired groups, or three-plus groups. Here are all three tests run against the same built-in datasets so you can see the output shape before starting the exercises.
Three calls, three p-values, same $p.value field on every result. The one-sample test fails to reject that median mpg equals 20 (p = 0.40). The Mann-Whitney test strongly rejects equal mpg across 4-cyl and 8-cyl cars (p = 4e-06). The Kruskal-Wallis test crushes the null that Sepal.Width is interchangeable across iris species (p = 1e-14). Same function family, same result shape, just different input layouts.
Here is the one-line decision rule:
| You have... | Are measurements paired? | R call |
|---|---|---|
| One sample + a claimed centre | N/A | wilcox.test(x, mu = m) |
| Two independent groups | No | wilcox.test(y ~ g) or wilcox.test(x, y) |
| Two groups, same subjects | Yes | wilcox.test(x, y, paired = TRUE) |
| Three or more groups | No | kruskal.test(y ~ g) |
wilcox.test(), which returns a W statistic equal to the Wilcoxon rank sum shifted by a constant. Whether a paper calls the result "U" or "W", the p-value is identical. "Non-parametric" means the test does not assume a specific distribution shape (like normal), not that it is assumption-free. It still assumes independence and, for the paired test, a symmetric distribution of differences.Try it: A trainer measures blood pressure for 15 patients before and after an intervention. Same patients, two measurements each. Which flag in wilcox.test() must you set? Save "paired" or "mu" to ex_flag.
Click to reveal solution
Explanation: Same subjects measured twice is a paired design. Set paired = TRUE and pass the two vectors, wilcox.test(before, after, paired = TRUE). mu is the one-sample null centre, irrelevant when you have two measurements per subject.
How do you read wilcox.test() and kruskal.test() output?
Both functions return a htest list with everything you need for a write-up: the test statistic, p-value, alternative hypothesis, and (for Wilcoxon with conf.int = TRUE) a confidence interval plus Hodges-Lehmann location estimate. Pulling fields by name gives one-line access for report tables, assumption checks, and downstream plots.
The Wilcoxon $estimate is the Hodges-Lehmann estimator, the median of all pairwise differences between the two samples. It is a rank-based analogue of a mean difference, and its 95% CI of 6.3 to 14.2 mpg is the interval you report instead of mean(set4) - mean(set8). For Kruskal-Wallis, $statistic is H and $parameter is df = groups minus 1. Those three fields plus the p-value are everything a reviewer asks for.
Here are the five fields you will reach for most often:
| Field | wilcox.test | kruskal.test | When to use |
|---|---|---|---|
$statistic |
W (rank-sum / signed-rank) | H (chi-sq approximation) | Always report |
$parameter |
not present | df = k − 1 | KW write-ups |
$p.value |
Tail probability | Tail probability | The decision number |
$estimate |
Hodges-Lehmann (with conf.int) |
not present | Effect magnitude |
$conf.int |
95% CI on HL (with conf.int) |
not present | Uncertainty range |
broom::tidy(mw_full) returns a clean estimate / statistic / p.value / method / alternative row that drops into markdown tables or ggplot panels. Useful when you run the same test across many subsets with purrr::map().correct = FALSE). For small samples with many ties, the fallback is inaccurate, consider coin::wilcox_test() for an exact permutation-based alternative.Try it: Extract just the p-value from a one-sample Wilcoxon test on mtcars$mpg with null median mu = 20. Save it to ex_p, rounded to 4 decimals.
Click to reveal solution
Explanation: wilcox.test(x, mu = 20) runs the one-sample signed-rank test of whether the pseudo-median of x equals 20. The observed median of mtcars$mpg is 19.2, close enough to 20 that with n = 32 the test cannot reject, p = 0.40.
Practice Exercises
Ten capstone problems, ordered roughly easier to harder. Every exercise uses a distinct ex<N>_ prefix so solutions do not clobber earlier tutorial state.
Exercise 1: One-sample Wilcoxon signed-rank against a claimed median
Test whether the median weight in mtcars$wt differs from 3.2 (thousand pounds). Save the full result to ex1_res. State the decision at α = 0.05 and explain why the pseudo-median is a better target than the simple median when the distribution is skewed.
Click to reveal solution
Explanation: The signed-rank test converts each x - 3.2 difference to a signed rank and asks whether the positive and negative ranks balance. V = 269, p = 0.31, so we fail to reject. The 95% CI on the pseudo-median (2.998 to 3.610) contains 3.2, telling the same story from a location-estimate angle. Use the pseudo-median instead of median() when you want a signed-rank-compatible centre for a potentially asymmetric distribution.
Exercise 2: Mann-Whitney U on two independent groups
Using the iris dataset, compare Petal.Length between setosa and versicolor. Save the result to ex2_res. Report W and the p-value.
Click to reveal solution
Explanation: Setosa petals range 1.0 to 1.9 cm and versicolor petals range 3.0 to 5.1 cm. Zero overlap means every setosa value ranks below every versicolor value, so W = 0 (the minimum possible). The p-value is below .Machine$double.eps, R prints < 2.2e-16. When two samples are perfectly separable, the rank-sum test returns the most extreme p-value it can produce.
Exercise 3: One-tailed Mann-Whitney (direction matters)
Re-run Exercise 2 with the alternative hypothesis "versicolor petals are longer than setosa petals". Save the result to ex3_res. Explain why a one-tailed test halves the p-value when the effect is in the hypothesised direction.
Click to reveal solution
Explanation: Putting versicolor first in wilcox.test(ver, set, alternative = "greater") tests whether the first sample is stochastically larger than the second. W = 2500 (the maximum possible for 50 × 50 samples), and the one-tailed p-value equals the right-tail area only. With perfect separation it saturates at the floating-point limit. For a less extreme effect, the one-tailed p would be exactly half the two-tailed p when the observed direction matches the alternative.
alternative = "two.sided" is the safe default, change it only when the direction is pre-specified.* Picking one-tailed after* peeking at group means inflates false positives and is a common reviewer red flag. Lock in one-tailed from your pre-registration or hypothesis, not from the data.Exercise 4: Paired Wilcoxon signed-rank on the sleep dataset
R's built-in sleep dataset records extra hours of sleep from the same 10 subjects taking two soporific drugs (groups 1 and 2). Run a paired Wilcoxon signed-rank test and save the result to ex4_res.
Click to reveal solution
Explanation: With paired = TRUE, R computes d <- ex4_d1 - ex4_d2 per subject, ranks the |d| values, signs them, and sums the positive ranks into V. Here every subject slept longer on drug 2 than drug 1, so every signed difference is negative, positive ranks sum to V = 0 (the minimum). The asymptotic p-value of 0.0091 rejects "no difference" at α = 0.05, drug 2 outperforms drug 1 in this sample.
Exercise 5: Tie handling, exact vs approximate p-values
Here is a small dataset with deliberate ties. Run wilcox.test() on it first with defaults (watch for the ties warning), then suppress the exact calculation with exact = FALSE. Save the final result to ex5_res.
Click to reveal solution
Explanation: The repeated 7 in ex5_tied blocks the exact computation. R falls back to a normal approximation with continuity correction and tie-corrected variance, W = 22.5, p = 0.94. Setting exact = FALSE explicitly makes the fallback deterministic and silences the warning. For small n with heavy ties consider coin::wilcox_test() which runs an exact permutation test.
Exercise 6: Kruskal-Wallis on three groups
Use the built-in PlantGrowth dataset (yields under a control and two treatments, n = 10 each). Save a Kruskal-Wallis test of weight ~ group to ex6_res.
Click to reveal solution
Explanation: H = 7.99 on df = 2 gives p = 0.018. At α = 0.05 we reject the null that all three groups share the same location. Kruskal-Wallis tells us somewhere in the three groups there is a difference, but it does not tell us where, that is the job of the post-hoc comparison in Exercise 7.
Exercise 7: Post-hoc pairwise comparisons with BH adjustment
After a significant Kruskal-Wallis, use pairwise.wilcox.test() with p.adjust.method = "BH" (Benjamini-Hochberg) to identify which pairs of plant-growth groups differ. Save the result to ex7_res.
Click to reveal solution
Explanation: Of the three pairs, only trt1 vs trt2 clears α = 0.05 after BH adjustment (adjusted p = 0.027). The ctrl vs trt2 contrast is borderline (p = 0.095) and ctrl vs trt1 is not significant. BH controls the expected false discovery rate, less conservative than Bonferroni (which would multiply raw p-values by 3). Always report the adjustment method, unadjusted pairwise p-values inflate the false-positive risk.
p.adjust.method by the error you want to control. "bonferroni" controls family-wise error rate (reject zero true nulls), strict but stable. "holm" does the same with a step-down gain in power. "BH" (Benjamini-Hochberg) controls the false discovery rate, the expected fraction of rejections that are wrong, and is the default in most modern biological and psychological reporting.Exercise 8: Rank-biserial effect size for Mann-Whitney
The p-value from Exercise 2 said "the two groups differ". Quantify how much with the rank-biserial correlation $r$, computed as $r = 1 - \frac{2U}{n_1 n_2}$ where $U$ is the smaller of the two U statistics. Save ex8_r and classify the magnitude.
Click to reveal solution
Explanation: With U1 = 0, U2 = 2500, the smaller is 0. Plug into the formula: $r = 1 - (2 \cdot 0)/(50 \cdot 50) = 1.0$. Perfect separation maps to $r = 1$, and that lines up with what we saw in Exercise 2. Cohen's thresholds on rank-biserial: 0.1 small, 0.3 medium, 0.5 large. An $r$ of 1.0 is as large as it goes.
Exercise 9: Epsilon-squared effect size for Kruskal-Wallis
A Kruskal-Wallis p-value says there is some difference, but $\varepsilon^2$ quantifies it. Use the formula $\varepsilon^2 = \frac{H (n + 1)}{n^2 - 1}$ on the iris Sepal.Width vs Species Kruskal-Wallis from the opening section. Classify by the convention 0.01 small, 0.08 medium, 0.26 large.
Click to reveal solution
Explanation: With $H = 63.57$ and $n = 150$, $\varepsilon^2 = 63.57 \cdot 151 / (150^2 - 1) = 0.427$. That sits well above the 0.26 "large" threshold: Species explains a meaningful share of the rank variance in Sepal.Width. Always pair a KW p-value with $\varepsilon^2$ (or $\eta^2_H$, which is nearly identical); a highly significant KW on a huge n can have a trivial effect.
Exercise 10: Full pipeline on chickwts, test → post-hoc → effect size
Real data usually arrives as one row per observation with a categorical grouping. R's built-in chickwts has the weight of 71 chicks across 6 feed types. Put the full workflow together: Kruskal-Wallis, pairwise post-hoc with Holm adjustment, and $\varepsilon^2$.
Click to reveal solution
Explanation: Three steps capture the whole story. Kruskal-Wallis says weights differ across feeds ($p = 5 \times 10^{-7}$). Holm-adjusted pairwise comparisons say casein beats horsebean and linseed but ties with meatmeal / soybean / sunflower, and horsebean is clearly the weakest feed. Epsilon-squared of 0.53 flags this as a very large effect, much of the weight variance is explained by feed choice. That three-line sketch, test / post-hoc / effect size, is the template for any categorical comparison in real data.
Complete Example: end-to-end chickwts analysis with reporting
Let's stitch the full workflow, including an assumption check, into one block that mirrors how a real report would read.
A one-paragraph APA-style write-up: A Kruskal-Wallis rank-sum test compared chick weights across six feeds in the chickwts dataset (N = 71). Weight differed significantly across feeds, $H(5) = 37.34$, $p < .001$, with a large effect, $\varepsilon^2 = 0.53$. Holm-adjusted pairwise Wilcoxon rank-sum tests indicated casein- and sunflower-fed chicks weighed significantly more than horsebean-fed chicks (all adjusted $p < .01$), while differences among the remaining feeds were not reliable. That sentence answers three questions at once: is the effect real, is it big, and where does it live.
Summary
| Exercise | Test | Key R call | Effect size |
|---|---|---|---|
| 1 | One-sample signed-rank | wilcox.test(x, mu) |
n/a |
| 2-3 | Mann-Whitney (2- and 1-sided) | wilcox.test(x, y) |
Rank-biserial $r$ |
| 4 | Paired signed-rank | wilcox.test(x, y, paired = TRUE) |
n/a |
| 5 | Tie handling | wilcox.test(..., exact = FALSE) |
n/a |
| 6 | Kruskal-Wallis | kruskal.test(y ~ g) |
Epsilon-squared |
| 7 | Post-hoc pairwise | pairwise.wilcox.test(y, g, "BH") |
n/a |
| 8-9 | Effect sizes | Formulas above | $r$, $\varepsilon^2$ |
| 10 | Full pipeline | KW + post-hoc + $\varepsilon^2$ | $\varepsilon^2$ |
Three principles carry across every exercise: pick the test from the design (one / two / paired / many groups); always pair the p-value with an effect size so the reader knows how big the effect is, not just whether it exists; and adjust pairwise post-hoc p-values so your family-wise error or false discovery rate is under control.
References
- R Core Team. wilcox.test, R Stats Reference. Link
- R Core Team. kruskal.test, R Stats Reference. Link
- Hollander, M., Wolfe, D. A., and Chicken, E. Nonparametric Statistical Methods, 3rd edition. Wiley (2013).
- Conover, W. J. Practical Nonparametric Statistics, 3rd edition. Wiley (1999).
- Wilcoxon, F. "Individual Comparisons by Ranking Methods." Biometrics Bulletin 1(6), 80-83 (1945).
- Mann, H. B. and Whitney, D. R. "On a test of whether one of two random variables is stochastically larger than the other." Annals of Mathematical Statistics 18(1), 50-60 (1947).
- Kruskal, W. H. and Wallis, W. A. "Use of Ranks in One-Criterion Variance Analysis." Journal of the American Statistical Association 47(260), 583-621 (1952).
- Tomczak, M. and Tomczak, E. "The need to report effect size estimates revisited." Trends in Sport Sciences 21(1), 19-25 (2014).
Continue Learning
- Wilcoxon, Mann-Whitney, and Kruskal-Wallis in R: Non-Parametric When Normality Fails covers the parent theory, decision rules, and assumption checks in detail.
- t-Test Exercises in R is the parametric counterpart, use it when the normality assumption holds.
- Hypothesis Testing Exercises in R gives broader inference practice across proportions, means, and non-parametrics with the same drill-sheet format.