dplyr group_by() in R: Grouped Operations Made Easy
The group_by() function in dplyr tags a data frame with one or more grouping variables. Subsequent verbs like summarise(), mutate(), and filter() then operate within each group instead of across the whole data frame.
group_by(df, cyl) # tag with one grouping var group_by(df, cyl, gear) # tag with multiple group_by(df, cyl) |> summarise(m = mean(mpg)) # group + summarise group_by(df, cyl) |> mutate(z = scale(mpg)) # group + mutate group_by(df, cyl) |> filter(mpg == max(mpg)) # group + top per group df |> ungroup() # remove all grouping summarise(df, m = mean(mpg), .by = cyl) # alternative without group_by
Need explanation? Read on for examples and pitfalls.
What group_by() does in one sentence
group_by() tags a data frame so subsequent verbs operate within groups instead of globally. It does not change the data; it adds a grouping attribute. Functions like mean() inside summarise() or mutate() then run once per group instead of once across all rows.
Unlike base R's tapply() or aggregate(), group_by() is composable: you can chain a group_by() into multiple operations, and the grouping persists until you ungroup() or summarise away.
Syntax
group_by() takes a data frame plus one or more grouping columns. The result is a "grouped data frame" that prints with a Groups: line. Use ungroup() to remove grouping. As of dplyr 1.1+, the .by argument inside summarise(), mutate(), and filter() provides one-off grouping without the persistence.
The full signature is:
group_by(.data, ..., .add = FALSE, .drop = TRUE)
.data is the data frame. The ... argument takes one or more grouping columns. .add = TRUE adds to existing groups instead of replacing. .drop = TRUE (default) drops empty groups for factor columns; set FALSE to keep all factor levels.
group_by(df, cyl) returns the same rows in the same order, just tagged with grouping info. The rows do not get reordered, sliced, or duplicated. The tag only matters when a downstream verb honors it.Six common patterns
1. Group + summarise
The most common use of group_by() is feeding into summarise() to compute per-group aggregates.
2. Group + mutate
mutate() inside a grouped data frame runs computations within each group. Each row gets a value relative to its own group's stats.
3. Group + filter for top per group
filter() inside a grouped data frame keeps rows that satisfy the condition WITHIN their group.
4. Group by multiple columns
Multi-column grouping creates one row per unique combination. Use .groups = "drop" to fully ungroup the result.
5. Removing groups with ungroup
After summarise(), dplyr 1.1+ peels off the last grouping level by default. For complex chains, use .groups = "drop" or call ungroup() to be explicit.
6. Using .by instead of group_by
.by (added in dplyr 1.1) groups for the single verb call only. Result is automatically ungrouped. Use this when grouping is a one-off and you do not want it to leak into subsequent operations.
group_by() persists; .by does not. That is the entire mental model. After group_by(df, cyl) |> summarise(...), the result MAY still be grouped depending on the dplyr version. After summarise(df, ..., .by = cyl), the result is always ungrouped. Pick .by for clarity in single-verb operations; use group_by() when several chained verbs need the same grouping.group_by() vs .by argument vs base R
Base R has no general grouping primitive; you use tapply(), aggregate(), split(), or by() depending on the case. dplyr's group_by() and .by provide one consistent grammar.
| Task | dplyr group_by | dplyr .by | Base R | ||
|---|---|---|---|---|---|
| Mean by group | `df \ | > group_by(g) \ | > summarise(m=mean(x))` | summarise(df, m=mean(x), .by=g) |
aggregate(x ~ g, df, mean) |
| Within-group transform | `df \ | > group_by(g) \ | > mutate(z=scale(x))` | mutate(df, z=scale(x), .by=g) |
df$z <- ave(df$x, df$g, FUN=scale) |
| Top per group | `df \ | > group_by(g) \ | > filter(x==max(x))` | filter(df, x==max(x), .by=g) |
do.call(rbind, lapply(split(df,df$g), function(d) d[d$x==max(d$x),])) |
| Multi-key group | group_by(df, g1, g2) |
.by = c(g1, g2) |
(multiple awkward options) |
When to use which:
- Use
.byfor single-verb grouped operations (cleanest). - Use
group_by()when several chained verbs need the same grouping. - Use base R primitives only when you cannot use the tidyverse.
Common pitfalls
Pitfall 1: forgetting to ungroup. A grouped data frame "remembers" its grouping. If you group_by() once and then run multiple unrelated operations, all of them inherit the grouping. Symptom: weird results in mutate() or slice() later. Fix: ungroup() explicitly, use .by instead, or rely on summarise's auto-peeling.
Pitfall 2: .groups warning noise. group_by(df, a, b) |> summarise(...) may print "summarise() has grouped output by 'a'". This is dplyr telling you it dropped one level of grouping but kept others. To silence, set .groups = "drop" (fully ungroup), "keep", "drop_last", or "rowwise" explicitly.
n(), cur_data(), and many other helpers. Inside a grouped frame, n() returns the per-group count. Inside an ungrouped frame, n() returns the total. Same code, different result. When debugging grouped operations, always verify whether you are inside a group context.Pitfall 3: factor .drop = TRUE silently drops empty groups. If your grouping variable is a factor and some levels have zero rows, those levels disappear by default. To keep all factor levels in the result (even with zero observations), set .drop = FALSE in group_by().
Try it yourself
Try it: Group mtcars by cyl, then add a column mpg_rank that ranks mpg within each cyl group (highest mpg = rank 1). Save the result to ex_ranked.
Click to reveal solution
Explanation: group_by(cyl) tags the data with cylinder grouping. Inside the grouped frame, rank(-mpg) ranks each row within its own cyl group (negation makes higher mpg = rank 1). ungroup() removes the grouping after, so subsequent operations are not surprised by it.
Related dplyr functions
After mastering group_by(), look at:
ungroup(): remove grouping.byargument: alternative one-off grouping insummarise(),mutate(),filter()summarise(),mutate(),filter(),slice(): the verbs that honor groupingn(),cur_group(),cur_group_id(),cur_group_rows(): helpers that use grouping contextgroup_split(),group_keys(),group_data(): introspect a grouped data framenest()plusunnest(): alternative grouping via list-columns
For very large data, also check data.table syntax which provides the same semantics with often faster execution per group.
FAQ
What is the difference between group_by and .by in dplyr?
group_by() tags the data frame with persistent grouping that affects every downstream verb until ungroup(). .by is a per-call argument that groups for the single operation and auto-ungroups the result. Use .by for one-off operations; group_by() when chained verbs all need the grouping.
How do I group by multiple columns in dplyr?
Pass them comma-separated: group_by(df, cyl, gear). The result has one group per unique combination. With .by, use a vector: summarise(df, ..., .by = c(cyl, gear)).
How do I ungroup a data frame in dplyr?
Pipe to ungroup(): df |> group_by(g) |> summarise(...) |> ungroup(). Or set .groups = "drop" inside the summarise call: summarise(df, ..., .groups = "drop").
Does group_by sort the data?
No. group_by() only adds a grouping attribute; it does not reorder rows. The internal grouping index lets dplyr operate on each group efficiently, but the visible row order stays the same. To reorder by group, chain arrange() after group_by() with .by_group = TRUE.
Can I use group_by inside a function?
Yes, but use the {{ }} (curly-curly) operator to forward column names: my_fn <- function(df, grp) df |> group_by({{ grp }}) |> summarise(n = n()). Without {{ }}, R interprets grp as a literal column name "grp" rather than the value the user passed.