dplyr group_by() + summarise(): The Combination That Answers Most Business Questions

In dplyr, group_by() splits your data into groups and summarise() collapses each group into a single row of aggregated values. Together they answer almost every "what's the average X by Y?" question, the single most common pattern in day-to-day data analysis.

By Selva Prabhakaran · Published May 13, 2026 · Last updated May 13, 2026

What does group_by() + summarise() actually do?

Most analytical questions sound the same: what's the average by category? which segment spends the most? how many orders per month? Every one of them follows one pattern, split rows into groups, apply a function, combine the results into a tidy table. group_by() marks the split and summarise() does the apply-and-combine in one step. Here's the payoff on the built-in mtcars dataset, answering "what's the average fuel economy by cylinder count?":

Rgroupby with summarise basics

library(dplyr) mpg_by_cyl <- mtcars |> group_by(cyl) |> summarise( avg_mpg = mean(mpg), n_cars = n() ) mpg_by_cyl #> # A tibble: 3 × 3 #> cyl avg_mpg n_cars #> <dbl> <dbl> <int> #> 1 4 26.7 11 #> 2 6 19.7 7 #> 3 8 15.1 14

The 32-row mtcars table collapsed into 3 rows, one per unique value of cyl. For each group, mean(mpg) was computed over just the rows in that group, and n() counted them. Four-cylinder cars average 26.7 mpg while eight-cylinder cars manage only 15.1, exactly the kind of answer a business analyst needs in one line of code. That's the whole pattern: one verb to split, one verb to apply-and-combine.

Split-apply-combine with group_by() + summarise()

Figure 1: The split-apply-combine pattern that group_by() + summarise() implements.

Key Insight

Think of group_by() as placing invisible dividers between rows. The rows themselves don't move, dplyr just marks which rows belong to which group. The real work happens in summarise(), which runs your functions once per marker boundary and stacks the results.

Try it: Use group_by() and summarise() on the built-in iris dataset to find the mean Sepal.Length for each species. Save the result to ex_iris_stats.

RExercise: Mean sepal by species

# Try it: mean Sepal.Length by Species ex_iris_stats <- iris |> group_by(Species) |> summarise( # your code here ) ex_iris_stats #> Expected: #> # A tibble: 3 × 2 #> Species avg_sepal #> <fct> <dbl> #> 1 setosa 5.01 #> 2 versicolor 5.94 #> 3 virginica 6.59

Click to reveal solution

RSepal length solution

ex_iris_stats <- iris |> group_by(Species) |> summarise(avg_sepal = mean(Sepal.Length)) ex_iris_stats #> # A tibble: 3 × 2 #> Species avg_sepal #> <fct> <dbl> #> 1 setosa 5.01 #> 2 versicolor 5.94 #> 3 virginica 6.59

Explanation: Inside summarise(), any function that returns one value per group works. mean(Sepal.Length) runs once per species, producing three rows.

How do you summarise multiple columns at once?

A real business question is rarely "just the average." You usually want the mean, the spread, the count, and the extremes, all in the same table so you can compare segments at a glance. summarise() happily takes as many name = function(col) expressions as you hand it, separated by commas, and each becomes a new column in the result.

RMultiple summary columns at once

mtcars_stats <- mtcars |> group_by(cyl) |> summarise( n_cars = n(), avg_mpg = mean(mpg), sd_mpg = sd(mpg), min_mpg = min(mpg), max_mpg = max(mpg) ) mtcars_stats #> # A tibble: 3 × 6 #> cyl n_cars avg_mpg sd_mpg min_mpg max_mpg #> <dbl> <int> <dbl> <dbl> <dbl> <dbl> #> 1 4 11 26.7 4.51 21.4 33.9 #> 2 6 7 19.7 1.45 17.8 21.4 #> 3 8 14 15.1 2.56 10.4 19.2

Five summary columns in one pass. Notice how n() is a special helper, it doesn't take a column name because it simply counts rows in the current group. The standard deviation column reveals something the mean alone hides: six-cylinder cars cluster tightly (sd = 1.45), while four-cylinder cars are all over the map (sd = 4.51). That's a story a mean-only table would completely bury.

Tip

Name summary columns by what they contain, not by how you computed them. Prefer avg_mpg to mean_mpg_value and n_orders to count_of_orders. Short, business-friendly names keep downstream filter() and arrange() calls readable.

Try it: Extend the mtcars summary above to also include median_mpg and mpg_range (max minus min). Save to ex_mtcars_stats.

RExercise: Add median and range

# Try it: add median and range ex_mtcars_stats <- mtcars |> group_by(cyl) |> summarise( avg_mpg = mean(mpg), # your code here ) ex_mtcars_stats #> Expected: 3 rows × 4 columns (cyl, avg_mpg, median_mpg, mpg_range)

Click to reveal solution

RMedian and range solution

ex_mtcars_stats <- mtcars |> group_by(cyl) |> summarise( avg_mpg = mean(mpg), median_mpg = median(mpg), mpg_range = max(mpg) - min(mpg) ) ex_mtcars_stats #> # A tibble: 3 × 4 #> cyl avg_mpg median_mpg mpg_range #> <dbl> <dbl> <dbl> <dbl> #> 1 4 26.7 26 12.5 #> 2 6 19.7 19.7 3.6 #> 3 8 15.1 15.2 8.8

Explanation: Any R expression returning length 1 per group works, including arithmetic on other summaries like max(mpg) - min(mpg).

How do you group by more than one variable?

Business questions often nest: average mpg by cylinder count and transmission type, monthly revenue by region and product. Pass multiple columns to group_by() and dplyr creates one group for every unique combination. The result table then has one row per combination.

RGroup by two variables

cyl_gear_stats <- mtcars |> group_by(cyl, gear) |> summarise( n_cars = n(), avg_mpg = mean(mpg), .groups = "drop" ) cyl_gear_stats #> # A tibble: 8 × 4 #> cyl gear n_cars avg_mpg #> <dbl> <dbl> <int> <dbl> #> 1 4 3 1 21.5 #> 2 4 4 8 26.9 #> 3 4 5 2 28.2 #> 4 6 3 2 19.8 #> 5 6 4 4 19.8 #> 6 6 5 1 19.7 #> 7 8 3 12 15.0 #> 8 8 5 2 15.4

Three cylinder counts times up-to-three gear counts gives eight populated combinations (not nine, no eight-cylinder car in mtcars has four gears). The .groups = "drop" line at the end removes all grouping from the result, which you almost always want. The next section explains why.

Try it: In the built-in starwars dataset, count how many characters exist for each combination of species and sex. Save the result to ex_sw_counts with column name n_chars.

RExercise: Count species and sex

# Try it: count by species × sex ex_sw_counts <- starwars |> group_by(species, sex) |> summarise( # your code here .groups = "drop" ) head(ex_sw_counts, 4) #> Expected: 4 rows with columns species, sex, n_chars

Click to reveal solution

RSpecies and sex solution

ex_sw_counts <- starwars |> group_by(species, sex) |> summarise(n_chars = n(), .groups = "drop") head(ex_sw_counts, 4) #> # A tibble: 4 × 3 #> species sex n_chars #> <chr> <chr> <int> #> 1 Aleena male 1 #> 2 Besalisk male 1 #> 3 Cerean male 1 #> 4 Chagrian male 1

Explanation: n() needs no arguments, it always counts rows in the current group, which after a two-variable group_by() means rows sharing both values.

Which summary functions work inside summarise()?

summarise() accepts any R function, as long as that function returns a single value when given a vector. That rule is simple but important: mean(x) returns one number, so it works. range(x) returns two numbers, so it doesn't (you'd need min(x) and max(x) separately). Here are the functions you'll use 90% of the time:

Function	What it returns	Example use
`mean(x)`	Arithmetic mean	Average order value
`median(x)`	Median	Typical salary, robust to outliers
`sd(x)`, `var(x)`	Spread	Volatility, consistency
`min(x)`, `max(x)`	Extremes	Fastest/slowest, cheapest/priciest
`n()`	Row count in current group	Number of orders per customer
`n_distinct(x)`	Unique value count	Distinct products bought per customer
`sum(x)`	Total	Total revenue per region
`first(x)`, `last(x)`	First/last value	Opening price, closing price
`quantile(x, 0.9)`	90th percentile	SLA thresholds

Here's n_distinct() in action, a question prose can barely ask concisely: how many unique hair colors does each species in Star Wars have?

RCount distinct values per group

sw_stats <- starwars |> filter(!is.na(hair_color)) |> group_by(species) |> summarise( n_chars = n(), distinct_hair = n_distinct(hair_color) ) |> arrange(desc(distinct_hair)) head(sw_stats, 5) #> # A tibble: 5 × 3 #> species n_chars distinct_hair #> <chr> <int> <int> #> 1 Human 31 6 #> 2 Wookiee 2 2 #> 3 Droid 4 1 #> 4 Twi'lek 2 1 #> 5 Gungan 3 1

Humans come out on top with six distinct hair colors across 31 characters; most other species have just one. The filter() before group_by() drops rows where hair_color is missing so they don't inflate our distinct counts with NA. Chaining arrange(desc()) at the end produces an immediately readable leaderboard.

Try it: Count the distinct eye_color values per species in starwars. Keep only species with 2 or more characters. Save to ex_hair_counts.

RExercise: Distinct eye colors

# Try it: n_distinct eye_color by species ex_hair_counts <- starwars |> filter(!is.na(eye_color)) |> group_by(species) |> summarise( n_chars = n(), # your code here ) |> filter(n_chars >= 2) |> arrange(desc(distinct_eye)) head(ex_hair_counts, 3) #> Expected: top 3 species by distinct eye colors

Click to reveal solution

RDistinct eye colors solution

ex_hair_counts <- starwars |> filter(!is.na(eye_color)) |> group_by(species) |> summarise( n_chars = n(), distinct_eye = n_distinct(eye_color) ) |> filter(n_chars >= 2) |> arrange(desc(distinct_eye)) head(ex_hair_counts, 3) #> # A tibble: 3 × 3 #> species n_chars distinct_eye #> <chr> <int> <int> #> 1 Human 35 10 #> 2 Droid 5 3 #> 3 Gungan 3 2

Explanation: n_distinct() counts unique non-NA values by default, which is almost always what you want for a distinct count.

What does the .groups argument do after summarise()?

Run a multi-variable group_by() followed by summarise() without .groups and dplyr prints this message:

`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.

That message confuses almost every new user. Here's what it means: when you group by cyl, gear and then summarise, dplyr assumes you might want to run another summary step on the remaining grouping (cyl), so it keeps the result grouped by all-but-the-last variable. Each call to summarise() peels off the innermost grouping layer.

Each summarise() call peels one grouping layer

Figure 2: Each summarise() call peels off one grouping layer. Use .groups = "drop" to stop peeling and return an ungrouped tibble.

The four options for .groups:

Value	Behavior
`"drop_last"`	Peel the last grouping variable (the default, shows the warning)
`"drop"`	Drop all grouping, return a plain ungrouped tibble
`"keep"`	Keep the full original grouping
`"rowwise"`	Treat each resulting row as its own group

RDrop grouping after summarise

cyl_gear_dropped <- mtcars |> group_by(cyl, gear) |> summarise(avg_mpg = mean(mpg), .groups = "drop") # Confirm: no grouping on result group_vars(cyl_gear_dropped) #> character(0)

group_vars() returns an empty character vector, confirming the result is fully ungrouped. That's the state you want 95% of the time, any further mutate() or filter() calls will act on the whole table instead of being silently group-aware.

Warning

Lingering groups cause silent bugs. A mutate(rank = row_number()) after an unintentionally grouped summarise will restart ranks at 1 inside each leftover group rather than numbering the whole table. Always finish with .groups = "drop" or ungroup() unless you specifically want grouped behavior downstream.

Try it: Group mtcars by am and cyl, summarise mean hp, and explicitly drop all grouping from the result. Save to ex_drop_stats.

RExercise: Drop groups explicitly

# Try it: drop grouping explicitly ex_drop_stats <- mtcars |> group_by(am, cyl) |> summarise( # your code here ) group_vars(ex_drop_stats) #> Expected: character(0)

Click to reveal solution

RDrop groups solution

ex_drop_stats <- mtcars |> group_by(am, cyl) |> summarise( avg_hp = mean(hp), .groups = "drop" ) group_vars(ex_drop_stats) #> character(0)

Explanation: Adding .groups = "drop" inside summarise() is cleaner than a separate ungroup() call at the end of the pipeline.

When should you use the new .by argument instead?

dplyr 1.1.0 (released January 2023) introduced a simpler alternative for one-shot aggregations: the .by argument, available directly on summarise(), mutate(), and filter(). It groups inline for that single call and always returns an ungrouped result, sidestepping the .groups confusion entirely.

RInline grouping with .by

# Old way, group_by() + .groups = "drop" by_cyl_old <- mtcars |> group_by(cyl) |> summarise(avg_mpg = mean(mpg), .groups = "drop") # New way, .by argument by_cyl <- mtcars |> summarise(avg_mpg = mean(mpg), .by = cyl) by_cyl #> # A tibble: 3 × 2 #> cyl avg_mpg #> <dbl> <dbl> #> 1 6 19.7 #> 2 4 26.7 #> 3 8 15.1

Same numbers, two fewer lines. For multiple grouping columns, wrap them in c(): .by = c(cyl, gear). The .by form shines when grouping is a one-off and no downstream operations need the grouping to persist. When grouping should carry through several pipeline steps, keep using group_by().

Note

.by requires dplyr 1.1.0 or newer. Check your version with packageVersion("dplyr"). On an older release, stick with group_by() + .groups = "drop", it works identically.

Try it: Rewrite this group_by() call using .by instead. Save to ex_by_rewrite.

RExercise: Rewrite groupby with .by

# Original: # mtcars |> group_by(gear) |> summarise(avg_mpg = mean(mpg), .groups = "drop") # Try it: rewrite with .by ex_by_rewrite <- mtcars |> summarise( # your code here ) ex_by_rewrite #> Expected: 3 rows, columns gear and avg_mpg

Click to reveal solution

Rby rewrite solution

ex_by_rewrite <- mtcars |> summarise(avg_mpg = mean(mpg), .by = gear) ex_by_rewrite #> # A tibble: 3 × 2 #> gear avg_mpg #> <dbl> <dbl> #> 1 4 24.5 #> 2 3 16.1 #> 3 5 21.4

Explanation: .by = gear replaces the whole group_by() |> ... |> ungroup() dance for a single summarise call.

How do you filter groups after summarising?

A common workflow: compute a per-group summary, then keep only the groups meeting some criterion, "regions with revenue above target," "products with more than 100 orders," "customers who bought at least three distinct items." Because summarise() returns a fresh tibble with one row per group, you can pipe it straight into filter() and arrange().

RFilter groups after summarising

top_cyl <- mtcars |> summarise( avg_mpg = mean(mpg), n_cars = n(), .by = cyl ) |> filter(avg_mpg > 18) |> arrange(desc(avg_mpg)) top_cyl #> # A tibble: 2 × 3 #> cyl avg_mpg n_cars #> <dbl> <dbl> <int> #> 1 4 26.7 11 #> 2 6 19.7 7

Two rows survive the avg_mpg > 18 filter, and arrange(desc(avg_mpg)) ranks them. This post-summarise filtering is how dplyr expresses SQL's HAVING clause, filter() before group_by() is WHERE, filter() after summarise() is HAVING. The mental model is the same in both worlds: filter rows, aggregate, then filter groups.

Try it: Find the top 2 gear groups in mtcars by mean hp. Use .by, then arrange and take the top 2. Save to ex_top_gear.

RExercise: Top 2 gears by hp

# Try it: top 2 gears by mean hp ex_top_gear <- mtcars |> summarise( # your code here .by = gear ) |> arrange(desc(avg_hp)) |> head(2) ex_top_gear #> Expected: 2 rows with gear and avg_hp columns

Click to reveal solution

RTop gears solution

ex_top_gear <- mtcars |> summarise(avg_hp = mean(hp), .by = gear) |> arrange(desc(avg_hp)) |> head(2) ex_top_gear #> # A tibble: 2 × 2 #> gear avg_hp #> <dbl> <dbl> #> 1 5 195.6 #> 2 3 176.1

Explanation: After summarise the result is a regular tibble, arrange() + head(2) gives the top-n rows by any column.

Practice Exercises

These capstones combine several patterns from above. Use my_* prefixed variable names so exercise code doesn't clobber variables used earlier in the tutorial.

Exercise 1: Homeworld leaderboard

In the starwars dataset, compute the mean height and character count for each homeworld. Drop rows with missing homeworld or height. Keep only homeworlds with 2 or more characters. Save the result sorted by avg_height descending to my_homeworld.

RExercise: Homeworld leaderboard

# Exercise 1: homeworld leaderboard # Hint: filter NA, group_by homeworld, summarise, filter n>=2, arrange my_homeworld <- starwars |> # write your code arrange(desc(avg_height)) head(my_homeworld, 5)

Click to reveal solution

RHomeworld leaderboard solution

my_homeworld <- starwars |> filter(!is.na(homeworld), !is.na(height)) |> group_by(homeworld) |> summarise( n_chars = n(), avg_height = mean(height), .groups = "drop" ) |> filter(n_chars >= 2) |> arrange(desc(avg_height)) head(my_homeworld, 5) #> # A tibble: 5 × 3 #> homeworld n_chars avg_height #> <chr> <int> <dbl> #> 1 Kamino 3 208. #> 2 Kashyyyk 2 231 #> 3 Corellia 2 175 #> 4 Tatooine 10 170. #> 5 Naboo 11 175.

Explanation: Two filter() calls play different roles, the first drops bad rows before grouping, the second drops small groups after summarising (SQL's WHERE vs HAVING).

Exercise 2: Cylinder efficiency ranking

For each cyl value in mtcars, compute mean mpg, then add a rank column (1 = best mpg). Save to my_mpg_rank. The final result should have columns cyl, avg_mpg, and rank, sorted by rank.

RExercise: Rank cylinders by mpg

# Exercise 2: rank cyl groups by mpg # Hint: summarise with .by, then arrange and mutate a rank column my_mpg_rank <- mtcars |> # write your code my_mpg_rank

Click to reveal solution

RCylinder rank solution

my_mpg_rank <- mtcars |> summarise(avg_mpg = mean(mpg), .by = cyl) |> arrange(desc(avg_mpg)) |> mutate(rank = row_number()) my_mpg_rank #> # A tibble: 3 × 3 #> cyl avg_mpg rank #> <dbl> <dbl> <int> #> 1 4 26.7 1 #> 2 6 19.7 2 #> 3 8 15.1 3

Explanation: Using .by keeps the result ungrouped, so row_number() numbers the whole table instead of restarting inside groups.

Exercise 3: Diamond cut efficiency

In the ggplot2::diamonds dataset, for each cut level compute the median price and the median price / carat (a crude "price efficiency" metric). Sort the result so the cut with the highest median price-per-carat is at the top. Save to my_cut_eff.

RExercise: Diamonds price per carat

# Exercise 3: diamonds price efficiency by cut library(ggplot2) # for diamonds my_cut_eff <- diamonds |> # write your code my_cut_eff

Click to reveal solution

RDiamonds price solution

library(ggplot2) my_cut_eff <- diamonds |> summarise( median_price = median(price), median_price_car = median(price / carat), .by = cut ) |> arrange(desc(median_price_car)) my_cut_eff #> # A tibble: 5 × 3 #> cut median_price median_price_car #> <ord> <dbl> <dbl> #> 1 Premium 3185 4453. #> 2 Fair 3282 3823. #> 3 Very Good 2648 3805. #> 4 Good 3050 3757. #> 5 Ideal 1810 3715. #> # Premium cut has the highest median price-per-carat.

Explanation: Derived quantities (price / carat) can be computed directly inside summarise(), no need to mutate() first. Premium beats Ideal here because Premium diamonds tend to be larger on average.

Complete Example

Here's a full end-to-end workflow on starwars that stitches every verb from this tutorial together: filter bad rows, group, compute several summaries, filter groups, sort. This is what 80% of real-world dplyr code looks like.

REnd-to-end species summary

species_summary <- starwars |> filter(!is.na(mass), !is.na(height)) |> summarise( n_chars = n(), avg_height = mean(height), avg_mass = mean(mass), max_mass = max(mass), .by = species ) |> filter(n_chars >= 2) |> arrange(desc(avg_mass)) head(species_summary, 5) #> # A tibble: 5 × 5 #> species n_chars avg_height avg_mass max_mass #> <chr> <int> <dbl> <dbl> <dbl> #> 1 Wookiee 2 231 124 124 #> 2 Gungan 2 235. 74 74 #> 3 Zabrak 2 173 80 80 #> 4 Twi'lek 2 179 55 55 #> 5 Human 22 177. 82.8 120

Every line corresponds to one question: drop missing mass or height, for each species, count and summarise, keep species with at least 2 characters, sort by average mass. That readability is the whole reason dplyr exists.

Summary

Concept	What to remember
`group_by(var)`	Marks invisible dividers between rows, no physical sorting or movement
`summarise(name = f(x))`	Collapses each group into one row; any length-1 function works
`n()`	Counts rows in the current group, no arguments needed
`n_distinct(x)`	Counts unique non-NA values per group
Multiple groupings	`group_by(a, b)` creates one group per unique combination
`.groups = "drop"`	Return an ungrouped tibble, almost always what you want
`.by` (dplyr 1.1+)	Inline grouping for one call; always returns ungrouped
Filter → Summarise → Filter	SQL's `WHERE` → `GROUP BY` → `HAVING` translated to dplyr

References

dplyr reference, group_by(). Link
dplyr reference, summarise(). Link
dplyr 1.1.0 release notes, introducing .by. Link
Wickham, H., & Grolemund, G., R for Data Science, Chapter 4: Data Transformation. Link
dplyr grouping vignette. Link
Wickham, H., "The Split-Apply-Combine Strategy for Data Analysis." Journal of Statistical Software 40(1), 2011. Link
Posit tidyverse blog, dplyr 1.0.0 summarise changes. Link

Continue Learning

dplyr filter() and select(), Row and column subsetting, the verbs you reach for before group_by.
dplyr mutate() and rename(), Add computed columns; pairs naturally with summarise for feature engineering.
R Pipe Operator, The |> and %>% operators that glue these pipelines together.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

dplyr group_by() + summarise(): The Combination That Answers Most Business Questions

What does group_by() + summarise() actually do?

How do you summarise multiple columns at once?

How do you group by more than one variable?

Which summary functions work inside summarise()?

What does the .groups argument do after summarise()?

When should you use the new .by argument instead?

How do you filter groups after summarising?

Practice Exercises

Exercise 1: Homeworld leaderboard

Exercise 2: Cylinder efficiency ranking

Exercise 3: Diamond cut efficiency

Complete Example

Summary

References

Continue Learning

Further Reading

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

dplyr group_by() + summarise(): The Combination That Answers Most Business Questions

What does group_by() + summarise() actually do?

How do you summarise multiple columns at once?

How do you group by more than one variable?

Which summary functions work inside summarise()?

What does the .groups argument do after summarise()?

When should you use the new .by argument instead?

How do you filter groups after summarising?

Practice Exercises

Exercise 1: Homeworld leaderboard

Exercise 2: Cylinder efficiency ranking

Exercise 3: Diamond cut efficiency

Complete Example

Summary

References

Continue Learning

Further Reading

Related Tutorials