dplyr group_by() & summarise() Exercises: 10 Aggregation Problems
Ten runnable, increasingly tricky aggregation problems built around dplyr::group_by() and summarise(), counts, multi-column groups, across(), NA handling, group shares, and per-group ranking. Each problem hides a fully worked solution behind a click-to-reveal so you can try first and verify after.
How do you count rows and take averages per group?
Counts and means are where almost every group-wise analysis starts. Before you touch missing values or per-group percentages, you need to be fluent at "split this data into groups, then collapse each group to a single number." The first three exercises drill exactly that, and the payoff block right below shows the shape of answer you are aiming at.
Three lines of dplyr produce a clean three-row answer. group_by(cyl) splits the 32 rows of mtcars into three pieces, n() counts the rows inside each piece, and mean(mpg) averages the mpg column inside each piece. .groups = "drop" returns a plain ungrouped tibble so the result behaves predictably in any later step. This is the shape every exercise below builds on.
count(df, group_col) is shorter than the full group_by() + summarise(n = n()) pipeline. As soon as you also need a mean or a sum, switch back to the full form.Exercise 1: Count cars per cylinder
Count how many cars in mtcars have 4, 6, and 8 cylinders. Save the result to my_counts. The output should have two columns: cyl and n.
Click to reveal solution
Explanation: group_by(cyl) splits the data into three pieces, one per unique cylinder value. n() counts the rows inside each piece. .groups = "drop" removes the grouping after summarising, so my_counts is a plain tibble. Without it you would see a friendly message from dplyr explaining which groups remain.
Exercise 2: Average mpg per cylinder
Compute the average mpg for each cyl group in mtcars. Round to one decimal. Save to my_mpg. The mean column should be named avg_mpg.
Click to reveal solution
Explanation: Four-cylinder cars average 26.7 mpg while eight-cylinder cars average 15.1 mpg, a 75 percent gap that mirrors what you would expect from engine size. mean(mpg) runs once per group because group_by(cyl) already split the rows. round(..., 1) formats the output. No na.rm needed here, mtcars has no missing values.
Exercise 3: Use count() as a shortcut
Repeat the count idea from Exercise 1, but use count() instead of group_by() + summarise(n = n()). Count cars per gear value. Save to my_gears.
Click to reveal solution
Explanation: count() is a one-line shortcut. It does the exact same thing as group_by(gear) |> summarise(n = n(), .groups = "drop"), but with less typing and no .groups argument to remember. Add sort = TRUE to order the result from largest group to smallest. Most dplyr users reach for count() whenever the question is purely "how many rows per group".
How do you group by multiple columns and handle missing values?
The first three exercises stayed inside one grouping column and one clean dataset. Real data is messier than that. The next three add two complications at once: a second grouping column, and missing values that quietly poison every summary you write. Here is a quick look at the dataset that brings the missing values, the starwars tibble that ships with dplyr.
The first five rows look clean, but starwars has 87 rows and several columns contain NA. That is the point, you will use this dataset to practise the na.rm = TRUE argument in Exercise 6. Exercise 4 stays on mtcars and Exercise 5 jumps to iris.
Exercise 4: Group by two columns
Group mtcars by both cyl and am (automatic = 0, manual = 1). Compute the count and mean mpg per combination. Save to my_combo. Use .groups = "drop" to return a plain tibble.
Click to reveal solution
Explanation: group_by(cyl, am) creates one group for every unique (cyl, am) combination present in the data, six groups in total. Manual four-cylinder cars average 28.1 mpg; automatic eight-cylinder cars average 15.0. The grouping order matters for the row order of the output, not for the values themselves.
Exercise 5: Summarise many columns with across()
Use the iris dataset. Group by Species and compute the mean of every numeric column in one call. Save to my_iris. The result should have five columns: Species plus the four numeric means.
Click to reveal solution
Explanation: across(where(is.numeric), mean) says: for every column that is numeric, apply mean(). This is the modern replacement for the old summarise_if() and summarise_at() helpers. If you needed two summaries per column instead of one, pass a named list: across(where(is.numeric), list(mean = mean, sd = sd)).
summarise(mean_a = mean(a), mean_b = mean(b), mean_c = mean(c), ...) breaks the moment you have twenty columns. With across() the same line of code works for four columns or four hundred, the data shape changes, the code does not.Exercise 6: Handle NA values with starwars
Use starwars. Group by species and compute mean height and mean mass. Drop any NA inputs from the means. Save to my_species. Sort the result by mean_height descending and keep only the top 5 rows.
Click to reveal solution
Explanation: na.rm = TRUE tells mean() to ignore missing values before averaging. Without it, any group containing one NA would return NA for the whole group. Notice the NaN in mean_mass for Quermian, that group had zero non-NA mass values, so the mean of an empty set is undefined. Real-world analysis almost always needs na.rm = TRUE.
NA values you did not expect. Always run summary(your_data) first to check which columns contain missing values, then plan na.rm = TRUE for every summariser that touches them.How do you filter groups, compute shares, and rank per group?
The last four exercises combine two or more dplyr ideas at once. You will filter groups by size, compute group shares as percentages, compare the .groups argument values side by side, and pull the top-k rows per group. These are the patterns that separate "I can call summarise()" from "I can write real analysis with it."
Exercise 7: Keep only groups with at least N rows
From starwars, compute the mean height per species, but keep only species with at least 2 characters in the dataset. Use na.rm = TRUE. Save to my_big_species and sort it from biggest group to smallest.
Click to reveal solution
Explanation: The summarise() call produces one row per species with both n and mean_height. Then filter(n >= 2) keeps only species that appear at least twice. This is the standard pattern for "ignore small or noisy groups" in analysis. Because .groups = "drop" was used, filter() operates on a plain tibble with no surprises.
Exercise 8: Group share as a percentage
For mtcars, compute each gear group's share of total mpg as a percentage. The output should have three columns: gear, sum_mpg, and pct_of_total. Save to my_share. The pct_of_total column should sum to 100.
Click to reveal solution
Explanation: The trick is .groups = "drop". After dropping, the mutate() call sees a flat three-row tibble and computes sum(sum_mpg) across all three rows, the grand total. Without dropping, mutate() would run inside each group and divide each value by itself, giving 100 percent for every row. That bug is silent.
mutate() is one of the top three sources of wrong percentage results in dplyr code.Exercise 9: Compare .groups = "drop" vs .groups = "keep"
Run two near-identical pipelines on mtcars: group by cyl and am, then summarise n = n(). In the first, use .groups = "drop". In the second, use .groups = "keep". After each, call group_vars() to see which grouping remains. Save the results to my_drop and my_keep.
Click to reveal solution
Explanation: group_vars() reports the active grouping columns. "drop" removes all grouping after summarise, so my_drop is ungrouped (an empty character vector). "keep" retains every grouping variable, both cyl and am. The other options are "drop_last" (removes only the rightmost grouping, this is dplyr's default when you do not specify) and "rowwise" (rare). Use "drop" as a safe default unless you specifically need the grouping later.
Exercise 10: Top-k per group using slice_max
From starwars, find the two heaviest characters per homeworld. Only consider rows where mass and homeworld are not NA. Save to my_top2. Sort the result by homeworld, then by mass descending within each homeworld.
Click to reveal solution
Explanation: slice_max(mass, n = 2) keeps the top 2 rows per group based on mass. Because group_by(homeworld) is still active, "top 2" means top 2 per homeworld, not top 2 overall. with_ties = FALSE stops slice_max() from keeping extra rows when two characters share the same mass. Filtering out NA rows first is essential, slice_max() treats NA as larger than any finite value by default, which gives surprising results otherwise.
top_n() still works but is superseded. Use slice_max() and slice_min() going forward, they have cleaner behaviour around ties, have an explicit with_ties argument, and accept a vector of any sortable type.What mistakes should you avoid with group_by() and summarise()?
Four mistakes trip up almost everyone who is new to these verbs. Each one runs without an error message, that is what makes them dangerous. Read the wrong-then-right pattern below and keep an eye out for the same shapes in your own code.
Mistake 1: Forgetting na.rm with missing data
The mean of a vector that contains even one NA is NA. Unless you tell R to ignore missing values, every group that contains a missing value returns NA for that group's summary.
The fix is one extra argument. The habit is to always check summary(your_data) first, if any column has NA values, plan for na.rm = TRUE on every summariser that touches it.
Mistake 2: Leaving groups attached after summarise
A grouped tibble behaves differently in downstream verbs. Percentages, joins, and even mutate() all change their meaning when grouping is silently still active.
Use .groups = "drop" or call ungroup() explicitly the moment your grouped step is finished.
Mistake 3: Using mean() on non-numeric columns
across(everything(), mean) errors out the instant any column is a character or factor. The fix is to scope across() to numeric columns only.
where(is.numeric) is a tidy-select helper that picks columns by type. Use it inside across() whenever you are not 100 percent sure every column is numeric.
Mistake 4: Not specifying .groups and being surprised by the message
dplyr prints a helpful note when you leave .groups off, but new users often mistake the note for an error.
Always set .groups explicitly. It documents your intent in the code and silences the message.
.groups behaviour is "drop_last", which prints a message; "drop" silences it. When you leave .groups out of a multi-column grouping, dplyr keeps every grouping column except the last and prints a friendly note. That message is helpful once and annoying forever, set .groups explicitly in every pipeline.Practice Exercises
The ten numbered problems above ARE your practice. Below are two bonus capstone challenges that combine three or more concepts at once. Each one is harder than any single exercise above, and each one is solvable using only verbs you have already met.
Capstone 1: Top 3 homeworlds by average mass
Using starwars, find the three homeworlds with the highest average character mass, but only consider homeworlds that have at least two characters. Drop any rows with missing mass or homeworld values. Save the result to cap1_top_homeworlds with three columns: homeworld, n, and mean_mass.
Click to reveal solution
Explanation: Four steps stacked into one pipeline. filter() drops the NA rows so mean() does not need na.rm and slice_max() does not get tricked by missing values. group_by() |> summarise() collapses to one row per homeworld with both the count and the mean. filter(n >= 2) excludes single-character homeworlds. slice_max(mean_mass, n = 3) returns the three biggest. This is the canonical pattern for "ranked groups, but only groups with enough support".
Capstone 2: Per-cylinder/transmission share of total horsepower
Using mtcars, group by both cyl and am. For each combination compute the total horsepower (total_hp), the mean mpg (mean_mpg), and that combination's share of grand-total horsepower as a percentage (pct_of_grand_total). Save the result to cap2_share. The percentage column should sum to 100.
Click to reveal solution
Explanation: group_by(cyl, am) makes six groups. summarise() collapses each group to one row containing total horsepower and mean mpg, then .groups = "drop" removes the grouping. The mutate() runs on a flat tibble, so sum(total_hp) is the grand total across all six rows, exactly what you want for the percentage. Automatic eight-cylinder cars alone account for 50 percent of the total horsepower in mtcars. Sorting by pct_of_grand_total makes the dominance obvious.
Complete Example
The exercises above each isolated one idea. Real analysis combines them. Here is a small, end-to-end fuel-economy report using mtcars that uses every core technique from this page in a single pipeline: multi-column grouping, multi-summary summarise(), .groups = "drop", post-summarise mutate() with a global denominator, and a final sort.
The report answers four questions in one tibble. Counting (Exercise 1, 4) gives n and the share pct_of_cars. Multi-column grouping (Exercise 4) splits the rows by cylinder and transmission. Multiple summaries in one summarise() (Exercise 5 in spirit) builds mean_mpg and mean_hp together. Post-summarise mutate on an ungrouped tibble (Exercise 8) computes the share of total cars and a derived mpg_per_hp efficiency metric. The sort then surfaces the headline finding: manual four-cylinder cars are the most fuel-efficient combination by both mean mpg and mpg-per-horsepower.
Summary
Nine patterns to keep at your fingertips when reaching for group_by() and summarise().
| Pattern | Use when | |
|---|---|---|
group_by() + summarise(n = n()) |
Counting rows per group | |
count(df, group_col) |
Shortcut when you only need counts | |
summarise(across(where(is.numeric), mean)) |
Many numeric columns at once | |
mean(x, na.rm = TRUE) |
Data has missing values | |
summarise(..., .groups = "drop") |
Default to drop grouping after summarising | |
filter(n >= k) after summarise |
Exclude tiny or noisy groups | |
| Two-step share: `summarise() | > mutate()` | Group share as percent of total |
group_by() + slice_max(x, n = k) |
Top-k rows per group | |
ungroup() before any join or mutate |
Avoid silent per-group behaviour |
References
- dplyr,
summarise()reference. tidyverse.org - dplyr,
group_by()reference. tidyverse.org - dplyr, Grouped data vignette. tidyverse.org
- dplyr,
across()reference. tidyverse.org - dplyr,
slice_max()reference. tidyverse.org - Wickham H. & Grolemund G., R for Data Science, 2nd edition, Chapter 4 (Data transformation). r4ds.hadley.nz
Continue Learning
- dplyr group_by() + summarise(): Aggregate Data by Group (10 Examples), the parent tutorial behind these exercises.
- dplyr filter() & select() Exercises: 12 Practice Problems, companion exercise set for row filtering and column picking.
- dplyr Exercises, broader dplyr practice spanning the full verb family.