data.table by in R: Group, Aggregate, and Summarize Rows

The by keyword in data.table groups rows so any expression in DT[i, j, by] runs once per unique value of the grouping columns and returns one result row per group.

⚡ Quick Answer
dt[, .N, by = cyl]                          # count rows per group
dt[, mean(mpg), by = cyl]                   # one summary per group
dt[, .(avg = mean(mpg)), by = cyl]          # name the output column
dt[, mean(mpg), by = .(cyl, gear)]          # group on two columns
dt[, mean(mpg), keyby = cyl]                # group and sort by group
dt[, mean(mpg), by = cyl > 6]               # group by an expression
dt[, lapply(.SD, mean), by = cyl]           # summarize every column

Need explanation? Read on for examples and pitfalls.

📊 Is by the right tool?
STARTaggregate per groupdt[, mean(x), by = grp]aggregate and sort the resultdt[, mean(x), keyby = grp]add a column without aggregatingdt[, new := x * 2]filter rows by conditiondt[x > 5]count rows per groupdt[, .N, by = grp]pick one row per groupdt[, .SD[1], by = grp]summarize many columns at oncedt[, lapply(.SD, mean), by = grp]

What by does in one sentence

The by keyword splits a data.table into groups and evaluates j on each group. You pass one or more grouping columns, and data.table runs the j expression separately for every unique combination, then stacks the results into a new table. The grouping columns appear alongside whatever j returns.

Because by lives inside DT[i, j, by], you can filter rows with i and group what survives in the same call. The split happens on the rows passed by i, not on the whole table.

Syntax

The grouping clause is the third argument in DT[i, j, by]. You can pass by as one column, several columns wrapped in .(), a character vector, or an expression that returns one or more vectors.

Run live
Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.
Rdata.table query signature
DT[i, j, by] # i = filter, j = expression, by = group DT[i, j, keyby = ...] # like by, but also sort and key the result

  

The accepted forms are:

  • by = col: one unquoted column name.
  • by = .(col1, col2): several unquoted column names.
  • by = "col" or by = c("col1", "col2"): column names as character.
  • by = expression: any expression that returns a vector (or list of vectors) the same length as the rows being grouped.
  • keyby = ...: identical to by, but the result is sorted and keyed on the grouping columns.

The output is always a new data.table with one row per group.

Examples by use case

Build a data.table once and reuse it for every example. All blocks share the same dt object created from mtcars.

RCreate a data.table from mtcars
library(data.table) dt <- as.data.table(mtcars, keep.rownames = "model") dt[1:3, .(model, mpg, cyl, gear)] #> model mpg cyl gear #> <char> <num> <num> <num> #> 1: Mazda RX4 21.0 6 4 #> 2: Mazda RX4 Wag 21.0 6 4 #> 3: Datsun 710 22.8 4 4

  

Count rows per group with .N and a single grouping column. .N is the row count of the current group.

RCount rows per group
dt[, .N, by = cyl] #> cyl N #> <num> <int> #> 1: 6 7 #> 2: 4 11 #> 3: 8 14

  

Name the summary column inside j. Wrap j in .() and pass name = expression.

ROne named summary per group
dt[, .(avg_mpg = mean(mpg)), by = cyl] #> cyl avg_mpg #> <num> <num> #> 1: 6 19.74286 #> 2: 4 26.66364 #> 3: 8 15.10000

  

Group on two columns by passing .(). Order in the call drives the column order in the output.

RGroup by two columns
dt[, .(avg_mpg = mean(mpg), n = .N), by = .(cyl, gear)] #> cyl gear avg_mpg n #> <num> <num> <num> <int> #> 1: 6 4 19.75000 4 #> 2: 4 4 26.92500 8 #> 3: 6 3 19.75000 2 #> 4: 8 3 15.05000 12 #> 5: 4 3 21.50000 1 #> 6: 4 5 28.20000 2 #> 7: 8 5 15.40000 2 #> 8: 6 5 19.70000 1

  

Group by an expression to create ad-hoc bins. The expression is evaluated once per row and the result drives the grouping.

RGroup by an expression
dt[, .(avg_mpg = mean(mpg), n = .N), by = .(big_engine = cyl > 6)] #> big_engine avg_mpg n #> <lgcl> <num> <int> #> 1: TRUE 15.10000 14 #> 2: FALSE 23.97222 18

  

Summarize many columns at once with .SD and lapply. .SD is the Subset of Data, every column except those listed in by.

RApply a function to every column per group
num_cols <- c("mpg", "hp", "wt") dt[, lapply(.SD, mean), by = cyl, .SDcols = num_cols] #> cyl mpg hp wt #> <num> <num> <num> <num> #> 1: 6 19.742857 122.28571 3.117143 #> 2: 4 26.663636 82.63636 2.285727 #> 3: 8 15.100000 209.21429 3.999214

  
Key Insight
by is the row-splitter; j is the per-group recipe. data.table runs j once for every group, so anything you can write for one table works inside j for every group: mean(), .N, .SD[1], a fitted model, even a plot. The grouping columns are added to the result automatically.

by vs keyby and i grouping

by and keyby are the same operation with different output guarantees. Plain by returns groups in the order they first appear; keyby sorts the result ascending and stores those columns as the table's key, so later queries on it can use binary search.

Form Sorts result? Keys result? Best when
by = cyl No No One-off summaries; row order does not matter
keyby = cyl Yes, ascending Yes You will join, subset, or further query the result
dt[cyl == 4, j] (no by) N/A N/A You want one summary across a filtered slice
by = .(cyl > 6) No No Ad-hoc bins without adding a column

The decision rule is short. Reach for keyby whenever the grouped table will be subset, joined, or queried again. Use plain by when you only need to print or write the result.

Common pitfalls

Forgetting .() around two columns silently groups by only the first one. by = cyl, gear parses gear as a separate argument in some calls and is a parse error in others. Always wrap multi-column groupings in .() or pass a character vector.

RWrong vs right multi-column by
nrow(dt[, .N, by = .(cyl, gear)]) #> [1] 8 nrow(dt[, .N, by = c("cyl", "gear")]) #> [1] 8

  
Warning
by keeps input order inside each group, but does not sort across groups. Plain by returns groups in the order they first appear in the data, which is fast but not alphabetical. If your j depends on order (head(), cummax(), shift()), sort the table with setorder() first or use keyby for sorted output.

.SD includes every non-by column by default, including character columns. lapply(.SD, mean) then warns or returns NA on non-numeric columns. Pass .SDcols explicitly to keep the summary clean.

Tip
Use by = .(name = expr) to label the grouping column in the output. Without a name, data.table calls the column V1 for unnamed expressions, which is hard to read. Naming inside by works the same way as naming inside j.

Try it yourself

Try it: Group airquality by Month and compute the mean Ozone (ignoring NA) and the row count per month. Save the result to ex_by.

RYour turn: group airquality by Month
# Try it: mean Ozone and row count per Month ex_by <- as.data.table(airquality) # your code here ex_by #> Expected: 5 rows with columns Month, avg_ozone, n

  
Click to reveal solution
RSolution
ex_by <- as.data.table(airquality) ex_by <- ex_by[, .(avg_ozone = mean(Ozone, na.rm = TRUE), n = .N), by = Month] ex_by #> Month avg_ozone n #> <int> <num> <int> #> 1: 5 23.61538 31 #> 2: 6 29.44444 30 #> 3: 7 59.11538 31 #> 4: 8 59.96154 31 #> 5: 9 31.44828 30

  

Explanation: Wrapping the two summaries in .() makes the output a data.table with named columns. na.rm = TRUE keeps the means from collapsing to NA on months that have missing Ozone readings.

by is one part of the DT[i, j, by] toolkit. Explore these next:

  • keyby: identical to by but sorts and keys the result.
  • setkey(): sort and key a table for fast binary-search subsets.
  • .SD / .SDcols: the subset of columns visible inside j for each group.
  • frollmean(): rolling means that compose with by for window summaries.
  • dcast(): pivot the long output of a by query into wide form.

See the official data.table introduction for the canonical reference on DT[i, j, by].

FAQ

What does the by keyword do in data.table?

by tells data.table to split the rows by one or more columns and run the j expression separately on each group. The result is a new data.table with one row per unique value of the grouping columns. Because by is the third argument of DT[i, j, by], you can filter and group in a single call, and the grouping happens on the filtered rows.

What is the difference between by and keyby in data.table?

Both group the data and evaluate j per group; the difference is the order and indexing of the output. by returns groups in the order they first appear; keyby sorts the result ascending and stores those columns as the table's key. Use keyby when the grouped output will be subset, joined, or queried again. Use by for one-off summaries.

How do I group by multiple columns in data.table?

Wrap the column names in .(), for example dt[, .N, by = .(cyl, gear)]. You can also pass a character vector, dt[, .N, by = c("cyl", "gear")], which is useful when the names sit in a variable. Both forms produce one row per unique combination, with the grouping columns added to the result.

Can by accept an expression instead of a column name?

Yes. by accepts any expression that returns a vector the same length as the rows being grouped. dt[, .N, by = cyl > 6] groups by the logical result TRUE or FALSE. Naming the expression with .() keeps the output readable, for example by = .(big_engine = cyl > 6). The expression is evaluated once per row.

Does by sort the result?

No. Plain by returns groups in the order they first appear in the input, which matches the speed of the implementation. Use keyby if you need the result sorted by the grouping columns; that form also stores a key, so later subsets and joins on the result use binary search.