dplyr group_by() + summarise(): Aggregate Data by Group (10 Examples)
group_by() splits data into groups. summarise() collapses each group to a single summary row. Together they replace complex tapply() and aggregate() calls with readable, chainable code.
"What's the average mpg per cylinder count?" is a grouped summary. Without dplyr you'd write tapply(mtcars$mpg, mtcars$cyl, mean). With dplyr: mtcars |> group_by(cyl) |> summarise(avg = mean(mpg)). Same answer, clearer intent.
**Explanation:** `group_by() + mutate()` computes `mean(mpg)` per group but keeps all rows. Each car gets its group's mean used in the percentage calculation.
Exercise 3: Per-Species Summary
For each iris Species, compute count, mean Sepal.Length, and the coefficient of variation (sd/mean * 100).
**Explanation:** Coefficient of variation (CV) = sd/mean × 100. It measures relative variability, making it comparable across groups with different means.
Summary
Function
Purpose
group_by(col)
Split data into groups
summarise(stat = fn(col))
One row per group
n()
Count rows in group
across(cols, fn)
Apply to multiple columns
count(col)
Shortcut: group + count
ungroup()
Remove grouping
.groups = "drop"
Ungroup after summarise
FAQ
What's the difference between summarise and summarize?
Nothing — they're aliases. Use whichever spelling you prefer.
Why do I get a ".groups" warning?
dplyr warns when you don't specify .groups after multi-column group_by(). Add .groups = "drop" for ungrouped output (most common) or "keep" to retain grouping.
Can I use custom functions in summarise?
Yes. Any function that takes a vector and returns a single value works: summarise(result = my_function(column)).
How do I summarise all numeric columns at once?
summarise(across(where(is.numeric), mean)) applies mean to every numeric column.