dplyr summarise() in R: Aggregate Data with Stats
The summarise() function in dplyr collapses many rows into one summary row using aggregation functions like mean(), sum(), or n(). Combined with group_by() or the .by argument, it produces one summary per group.
summarise(df, avg = mean(mpg)) # one summary summarise(df, n = n(), avg = mean(mpg)) # multiple stats summarise(df, avg = mean(mpg), .by = cyl) # one row per group group_by(df, cyl) |> summarise(avg = mean(mpg)) # same via group_by summarise(df, across(where(is.numeric), mean)) # all numeric cols summarise(df, p25 = quantile(mpg, 0.25), p75 = quantile(mpg, 0.75)) # custom summarise(df, .by = cyl, n = n(), mn = min(mpg), mx = max(mpg)) # multi
Need explanation? Read on for examples and pitfalls.
What summarise() does in one sentence
summarise() collapses rows into a single summary row. You give it a data frame and one or more aggregation expressions of the form name = function(column). The result has one row total (or one row per group when grouped), with only the columns you named.
Note: dplyr accepts both summarise() (British) and summarize() (American). They are aliases. Use either.
Unlike base R aggregate(), summarise integrates into pipelines, computes multiple statistics in one call, names result columns clearly, and pairs naturally with group_by() or .by for per-group aggregation.
Syntax
summarise() takes a data frame plus aggregation expressions. Each expression must return a scalar (single value) per group. Use n() for row counts, n_distinct() for unique counts, and across() for applying the same function to many columns.
The full signature is:
summarise(.data, ..., .by = NULL, .groups = NULL)
.data is the data frame. The ... argument takes one or more name = aggregation_expr pairs. .by provides ad-hoc grouping. .groups controls what to do with the grouping after summarise (relevant only when chained from group_by()).
.by for one-off grouped summaries; use group_by() when downstream verbs also need the grouping. .by auto-ungroups after the call; group_by() leaves the result grouped, which can surprise the next mutate or filter. Reach for .by first.Seven common patterns
1. Single summary across all rows
The result is a one-row data frame with the column name you supplied.
2. Multiple statistics in one call
n() returns the row count. Each named expression becomes a column in the result.
3. Per-group summary with .by
.by = cyl groups for this single call only. The result is automatically ungrouped.
4. Per-group summary with group_by
Functionally identical to the .by version. Choose .by when grouping is local to this call; group_by() when subsequent verbs in the pipeline also need it.
5. Apply same function to many columns
across() plus a tidyselect helper applies one function to multiple columns. Replaces legacy summarise_at(), summarise_if(), summarise_all().
6. Custom quantile statistics
Any function that returns a scalar from a vector works inside summarise.
7. Distinct counts and presence checks
n_distinct() counts unique values. Predicates like any() and all() produce TRUE/FALSE summaries.
mean(x) returns one number, fine. range(x) returns two numbers, error. To split multi-value returns into separate columns, name each one explicitly: mn = min(x), mx = max(x). As of dplyr 1.1+, reframe() is the alternative when you genuinely need vector-valued summaries.summarise() vs base R aggregation
Base R offers aggregate() and tapply() for per-group aggregation; summarise wraps these in pipeline-friendly syntax. The result of summarise is always a data frame; tapply() returns an array, aggregate() returns a data frame with awkward column names.
| Task | dplyr | Base R |
|---|---|---|
| Mean of one column | summarise(df, m = mean(x)) |
mean(df$x) |
| Mean by group | summarise(df, m = mean(x), .by = g) |
aggregate(x ~ g, df, mean) |
| Multiple stats by group | summarise(df, m = mean(x), s = sd(x), .by = g) |
aggregate(cbind(m=mean(x), s=sd(x)) ~ g, df, ...) (awkward) |
| Count rows by group | summarise(df, n = n(), .by = g) |
table(df$g) |
| Across many columns | summarise(df, across(where(is.numeric), mean), .by = g) |
aggregate(. ~ g, df, mean) |
When to use which:
- Use
summarise()for any multi-statistic or pipelined aggregation. - Use base R
mean(),sum(),tapply(), etc. for one-line scripts on a vector.
Common pitfalls
Pitfall 1: forgetting NA handling. mean(starwars$mass) returns NA because some rows are missing. mean(starwars$mass, na.rm = TRUE) ignores missing values. Inside summarise: summarise(starwars, avg = mean(mass, na.rm = TRUE)).
Pitfall 2: result is grouped after group_by() |> summarise(). dplyr 1.1+ peels off ONE level of grouping by default but the result may still be grouped if you grouped by multiple keys. Either chain ungroup() or use .groups = "drop" to fully ungroup. Or use .by instead.
summarise() and summarize() interchangeably, and the same for colour/color. Mixing them in the same codebase confuses readers and grep searches. Pick the spelling your team prefers and use it consistently.Pitfall 3: trying to return multiple values per group. summarise(df, q = quantile(x, c(0.25, 0.75))) errors because each group expression must return a scalar. Either name each value (p25 = quantile(x, 0.25), p75 = quantile(x, 0.75)) or use reframe() (dplyr 1.1+) for vector returns.
Try it yourself
Try it: For each cyl group in mtcars, compute the mean mpg and the count of rows. Save the result to ex_by_cyl.
Click to reveal solution
Explanation: summarise() with .by = cyl produces one row per unique cyl value. n() counts rows in each group; mean(mpg) averages within each group. The .by form auto-ungroups the result.
Related dplyr functions
After mastering summarise(), look at:
group_by(),ungroup(): persistent grouping for chained operationscount(): shortcut forsummarise(n = n(), .by = ...)tally(),add_count(): variations on row countingreframe(): for summaries that return multiple rows per groupn(),n_distinct(),cur_group(),cur_group_id(): helpers that work inside summariseacross()with tidyselect: bulk column summaries
For very large data, also check data.table syntax which provides the same semantics with often faster execution.
FAQ
What is the difference between summarise and summarize in dplyr?
They are identical aliases. dplyr accepts both spellings to support British and American English users. Pick one and stick with it for consistency.
How do I summarise multiple columns in dplyr?
Use across() with a tidyselect helper: summarise(df, across(where(is.numeric), mean)) computes the mean of every numeric column. To apply multiple functions: summarise(df, across(where(is.numeric), list(mean = mean, sd = sd))).
How do I count rows in dplyr summarise?
Use n() inside summarise: summarise(df, n = n()). For unique-value counts use n_distinct(col). For a quick group-counted shortcut, count(df, group_col) is equivalent to df |> summarise(n = n(), .by = group_col).
What is the difference between summarise with .by vs group_by?
.by groups for the single summarise call and auto-ungroups the result. group_by() sets a persistent grouping that affects subsequent verbs (filter, mutate, summarise) until you call ungroup(). Use .by for one-off summaries; use group_by() for chains that need the grouping throughout.
Can I use custom functions inside summarise?
Yes, any function that returns a scalar per group works: summarise(df, p90 = quantile(x, 0.9)). For functions returning multiple values, use reframe() (dplyr 1.1+) instead of summarise().