dplyr summarise() in R: Aggregate Data with Stats

The summarise() function in dplyr collapses many rows into one summary row using aggregation functions like mean(), sum(), or n(). Combined with group_by() or the .by argument, it produces one summary per group.

⚡ Quick Answer
summarise(df, avg = mean(mpg))                       # one summary
summarise(df, n = n(), avg = mean(mpg))              # multiple stats
summarise(df, avg = mean(mpg), .by = cyl)            # one row per group
group_by(df, cyl) |> summarise(avg = mean(mpg))      # same via group_by
summarise(df, across(where(is.numeric), mean))       # all numeric cols
summarise(df, p25 = quantile(mpg, 0.25), p75 = quantile(mpg, 0.75))  # custom
summarise(df, .by = cyl, n = n(), mn = min(mpg), mx = max(mpg))      # multi

Need explanation? Read on for examples and pitfalls.

📊 Is summarise() the right tool?
STARTcollapse rows to one per groupsummarise(df, m = mean(x), .by = g)keep all rows but add summary colmutate(df, m = mean(x), .by = g)just count rows per groupcount(df, g)one row per group from raw datadistinct(df, g, .keep_all = TRUE)multiple values per group (vector)reframe(df, q = quantile(x, c(.25, .75)), .by = g)window function (rank, lag)mutate(df, rk = rank(x), .by = g)aggregate without groupingsummarise(df, m = mean(x))

What summarise() does in one sentence

summarise() collapses rows into a single summary row. You give it a data frame and one or more aggregation expressions of the form name = function(column). The result has one row total (or one row per group when grouped), with only the columns you named.

Note: dplyr accepts both summarise() (British) and summarize() (American). They are aliases. Use either.

Unlike base R aggregate(), summarise integrates into pipelines, computes multiple statistics in one call, names result columns clearly, and pairs naturally with group_by() or .by for per-group aggregation.

Syntax

summarise() takes a data frame plus aggregation expressions. Each expression must return a scalar (single value) per group. Use n() for row counts, n_distinct() for unique counts, and across() for applying the same function to many columns.

Run live
Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.
RLoad dplyr and inspect mtcars
library(dplyr) mtcars |> select(mpg, cyl, hp) |> head(3) #> mpg cyl hp #> Mazda RX4 21.0 6 110 #> Mazda RX4 Wag 21.0 6 110 #> Datsun 710 22.8 4 93

  

The full signature is:

summarise(.data, ..., .by = NULL, .groups = NULL)

.data is the data frame. The ... argument takes one or more name = aggregation_expr pairs. .by provides ad-hoc grouping. .groups controls what to do with the grouping after summarise (relevant only when chained from group_by()).

Tip
Use .by for one-off grouped summaries; use group_by() when downstream verbs also need the grouping. .by auto-ungroups after the call; group_by() leaves the result grouped, which can surprise the next mutate or filter. Reach for .by first.

Seven common patterns

1. Single summary across all rows

RMean mpg of all cars
mtcars |> summarise(avg_mpg = mean(mpg)) #> avg_mpg #> 1 20.09062

  

The result is a one-row data frame with the column name you supplied.

2. Multiple statistics in one call

RCount, mean, sd, min, max of mpg
mtcars |> summarise( n = n(), avg = mean(mpg), sd = sd(mpg), mn = min(mpg), mx = max(mpg) ) #> n avg sd mn mx #> 1 32 20.09062 6.026948 10.4 33.9

  

n() returns the row count. Each named expression becomes a column in the result.

3. Per-group summary with .by

RMean mpg per cylinder count
mtcars |> summarise(avg_mpg = mean(mpg), n = n(), .by = cyl) #> cyl avg_mpg n #> 1 6 19.74286 7 #> 2 4 26.66364 11 #> 3 8 15.10000 14

  

.by = cyl groups for this single call only. The result is automatically ungrouped.

4. Per-group summary with group_by

RSame result via group_by then summarise
mtcars |> group_by(cyl) |> summarise(avg_mpg = mean(mpg), n = n()) #> # A tibble: 3 x 3 #> cyl avg_mpg n #> <dbl> <dbl> <int> #> 1 4 26.7 11 #> 2 6 19.7 7 #> 3 8 15.1 14

  

Functionally identical to the .by version. Choose .by when grouping is local to this call; group_by() when subsequent verbs in the pipeline also need it.

5. Apply same function to many columns

RMean of every numeric column, by cyl
mtcars |> summarise(across(where(is.numeric), mean), .by = cyl) #> cyl mpg disp hp drat wt qsec ... #> 1 6 19.74286 183.3143 122.28571 3.585714 3.117143 17.97714 #> 2 4 26.66364 105.1364 82.63636 4.070909 2.285727 19.13727 #> 3 8 15.10000 353.1000 209.21429 3.229286 3.999214 16.77214

  

across() plus a tidyselect helper applies one function to multiple columns. Replaces legacy summarise_at(), summarise_if(), summarise_all().

6. Custom quantile statistics

R25th and 75th percentiles of mpg
mtcars |> summarise( p25 = quantile(mpg, 0.25), median = median(mpg), p75 = quantile(mpg, 0.75) ) #> p25 median p75 #> 1 15.425 19.2 22.8

  

Any function that returns a scalar from a vector works inside summarise.

7. Distinct counts and presence checks

RDistinct counts and presence flags
starwars |> summarise( n_chars = n(), n_species = n_distinct(species, na.rm = TRUE), has_gold = any(skin_color == "gold", na.rm = TRUE) ) #> # A tibble: 1 x 3 #> n_chars n_species has_gold #> <int> <int> <lgl> #> 1 87 37 TRUE

  

n_distinct() counts unique values. Predicates like any() and all() produce TRUE/FALSE summaries.

Key Insight
Every expression inside summarise must return a SCALAR per group. mean(x) returns one number, fine. range(x) returns two numbers, error. To split multi-value returns into separate columns, name each one explicitly: mn = min(x), mx = max(x). As of dplyr 1.1+, reframe() is the alternative when you genuinely need vector-valued summaries.

summarise() vs base R aggregation

Base R offers aggregate() and tapply() for per-group aggregation; summarise wraps these in pipeline-friendly syntax. The result of summarise is always a data frame; tapply() returns an array, aggregate() returns a data frame with awkward column names.

Task dplyr Base R
Mean of one column summarise(df, m = mean(x)) mean(df$x)
Mean by group summarise(df, m = mean(x), .by = g) aggregate(x ~ g, df, mean)
Multiple stats by group summarise(df, m = mean(x), s = sd(x), .by = g) aggregate(cbind(m=mean(x), s=sd(x)) ~ g, df, ...) (awkward)
Count rows by group summarise(df, n = n(), .by = g) table(df$g)
Across many columns summarise(df, across(where(is.numeric), mean), .by = g) aggregate(. ~ g, df, mean)

When to use which:

  • Use summarise() for any multi-statistic or pipelined aggregation.
  • Use base R mean(), sum(), tapply(), etc. for one-line scripts on a vector.

Common pitfalls

Pitfall 1: forgetting NA handling. mean(starwars$mass) returns NA because some rows are missing. mean(starwars$mass, na.rm = TRUE) ignores missing values. Inside summarise: summarise(starwars, avg = mean(mass, na.rm = TRUE)).

Pitfall 2: result is grouped after group_by() |> summarise(). dplyr 1.1+ peels off ONE level of grouping by default but the result may still be grouped if you grouped by multiple keys. Either chain ungroup() or use .groups = "drop" to fully ungroup. Or use .by instead.

Warning
British vs American spelling: pick one and stick with it. dplyr accepts summarise() and summarize() interchangeably, and the same for colour/color. Mixing them in the same codebase confuses readers and grep searches. Pick the spelling your team prefers and use it consistently.

Pitfall 3: trying to return multiple values per group. summarise(df, q = quantile(x, c(0.25, 0.75))) errors because each group expression must return a scalar. Either name each value (p25 = quantile(x, 0.25), p75 = quantile(x, 0.75)) or use reframe() (dplyr 1.1+) for vector returns.

Try it yourself

Try it: For each cyl group in mtcars, compute the mean mpg and the count of rows. Save the result to ex_by_cyl.

RYour turn: per-group summary
# Try it: mean mpg and row count per cylinder ex_by_cyl <- # your code here ex_by_cyl #> Expected: 3 rows, one per cyl, with avg_mpg and n columns

  
Click to reveal solution
RSolution
ex_by_cyl <- mtcars |> summarise(avg_mpg = mean(mpg), n = n(), .by = cyl) ex_by_cyl #> cyl avg_mpg n #> 1 6 19.74286 7 #> 2 4 26.66364 11 #> 3 8 15.10000 14

  

Explanation: summarise() with .by = cyl produces one row per unique cyl value. n() counts rows in each group; mean(mpg) averages within each group. The .by form auto-ungroups the result.

After mastering summarise(), look at:

  • group_by(), ungroup(): persistent grouping for chained operations
  • count(): shortcut for summarise(n = n(), .by = ...)
  • tally(), add_count(): variations on row counting
  • reframe(): for summaries that return multiple rows per group
  • n(), n_distinct(), cur_group(), cur_group_id(): helpers that work inside summarise
  • across() with tidyselect: bulk column summaries

For very large data, also check data.table syntax which provides the same semantics with often faster execution.

FAQ

What is the difference between summarise and summarize in dplyr?

They are identical aliases. dplyr accepts both spellings to support British and American English users. Pick one and stick with it for consistency.

How do I summarise multiple columns in dplyr?

Use across() with a tidyselect helper: summarise(df, across(where(is.numeric), mean)) computes the mean of every numeric column. To apply multiple functions: summarise(df, across(where(is.numeric), list(mean = mean, sd = sd))).

How do I count rows in dplyr summarise?

Use n() inside summarise: summarise(df, n = n()). For unique-value counts use n_distinct(col). For a quick group-counted shortcut, count(df, group_col) is equivalent to df |> summarise(n = n(), .by = group_col).

What is the difference between summarise with .by vs group_by?

.by groups for the single summarise call and auto-ungroups the result. group_by() sets a persistent grouping that affects subsequent verbs (filter, mutate, summarise) until you call ungroup(). Use .by for one-off summaries; use group_by() for chains that need the grouping throughout.

Can I use custom functions inside summarise?

Yes, any function that returns a scalar per group works: summarise(df, p90 = quantile(x, 0.9)). For functions returning multiple values, use reframe() (dplyr 1.1+) instead of summarise().