data.table by in R: Group, Aggregate, and Summarize Rows
The by keyword in data.table groups rows so any expression in DT[i, j, by] runs once per unique value of the grouping columns and returns one result row per group.
dt[, .N, by = cyl] # count rows per group dt[, mean(mpg), by = cyl] # one summary per group dt[, .(avg = mean(mpg)), by = cyl] # name the output column dt[, mean(mpg), by = .(cyl, gear)] # group on two columns dt[, mean(mpg), keyby = cyl] # group and sort by group dt[, mean(mpg), by = cyl > 6] # group by an expression dt[, lapply(.SD, mean), by = cyl] # summarize every column
Need explanation? Read on for examples and pitfalls.
What by does in one sentence
The by keyword splits a data.table into groups and evaluates j on each group. You pass one or more grouping columns, and data.table runs the j expression separately for every unique combination, then stacks the results into a new table. The grouping columns appear alongside whatever j returns.
Because by lives inside DT[i, j, by], you can filter rows with i and group what survives in the same call. The split happens on the rows passed by i, not on the whole table.
Syntax
The grouping clause is the third argument in DT[i, j, by]. You can pass by as one column, several columns wrapped in .(), a character vector, or an expression that returns one or more vectors.
The accepted forms are:
by = col: one unquoted column name.by = .(col1, col2): several unquoted column names.by = "col"orby = c("col1", "col2"): column names as character.by = expression: any expression that returns a vector (or list of vectors) the same length as the rows being grouped.keyby = ...: identical toby, but the result is sorted and keyed on the grouping columns.
The output is always a new data.table with one row per group.
Examples by use case
Build a data.table once and reuse it for every example. All blocks share the same dt object created from mtcars.
Count rows per group with .N and a single grouping column. .N is the row count of the current group.
Name the summary column inside j. Wrap j in .() and pass name = expression.
Group on two columns by passing .(). Order in the call drives the column order in the output.
Group by an expression to create ad-hoc bins. The expression is evaluated once per row and the result drives the grouping.
Summarize many columns at once with .SD and lapply. .SD is the Subset of Data, every column except those listed in by.
by is the row-splitter; j is the per-group recipe. data.table runs j once for every group, so anything you can write for one table works inside j for every group: mean(), .N, .SD[1], a fitted model, even a plot. The grouping columns are added to the result automatically.by vs keyby and i grouping
by and keyby are the same operation with different output guarantees. Plain by returns groups in the order they first appear; keyby sorts the result ascending and stores those columns as the table's key, so later queries on it can use binary search.
| Form | Sorts result? | Keys result? | Best when |
|---|---|---|---|
by = cyl |
No | No | One-off summaries; row order does not matter |
keyby = cyl |
Yes, ascending | Yes | You will join, subset, or further query the result |
dt[cyl == 4, j] (no by) |
N/A | N/A | You want one summary across a filtered slice |
by = .(cyl > 6) |
No | No | Ad-hoc bins without adding a column |
The decision rule is short. Reach for keyby whenever the grouped table will be subset, joined, or queried again. Use plain by when you only need to print or write the result.
Common pitfalls
Forgetting .() around two columns silently groups by only the first one. by = cyl, gear parses gear as a separate argument in some calls and is a parse error in others. Always wrap multi-column groupings in .() or pass a character vector.
by keeps input order inside each group, but does not sort across groups. Plain by returns groups in the order they first appear in the data, which is fast but not alphabetical. If your j depends on order (head(), cummax(), shift()), sort the table with setorder() first or use keyby for sorted output..SD includes every non-by column by default, including character columns. lapply(.SD, mean) then warns or returns NA on non-numeric columns. Pass .SDcols explicitly to keep the summary clean.
by = .(name = expr) to label the grouping column in the output. Without a name, data.table calls the column V1 for unnamed expressions, which is hard to read. Naming inside by works the same way as naming inside j.Try it yourself
Try it: Group airquality by Month and compute the mean Ozone (ignoring NA) and the row count per month. Save the result to ex_by.
Click to reveal solution
Explanation: Wrapping the two summaries in .() makes the output a data.table with named columns. na.rm = TRUE keeps the means from collapsing to NA on months that have missing Ozone readings.
Related data.table functions
by is one part of the DT[i, j, by] toolkit. Explore these next:
keyby: identical tobybut sorts and keys the result.setkey(): sort and key a table for fast binary-search subsets..SD/.SDcols: the subset of columns visible insidejfor each group.frollmean(): rolling means that compose withbyfor window summaries.dcast(): pivot the long output of abyquery into wide form.
See the official data.table introduction for the canonical reference on DT[i, j, by].
FAQ
What does the by keyword do in data.table?
by tells data.table to split the rows by one or more columns and run the j expression separately on each group. The result is a new data.table with one row per unique value of the grouping columns. Because by is the third argument of DT[i, j, by], you can filter and group in a single call, and the grouping happens on the filtered rows.
What is the difference between by and keyby in data.table?
Both group the data and evaluate j per group; the difference is the order and indexing of the output. by returns groups in the order they first appear; keyby sorts the result ascending and stores those columns as the table's key. Use keyby when the grouped output will be subset, joined, or queried again. Use by for one-off summaries.
How do I group by multiple columns in data.table?
Wrap the column names in .(), for example dt[, .N, by = .(cyl, gear)]. You can also pass a character vector, dt[, .N, by = c("cyl", "gear")], which is useful when the names sit in a variable. Both forms produce one row per unique combination, with the grouping columns added to the result.
Can by accept an expression instead of a column name?
Yes. by accepts any expression that returns a vector the same length as the rows being grouped. dt[, .N, by = cyl > 6] groups by the logical result TRUE or FALSE. Naming the expression with .() keeps the output readable, for example by = .(big_engine = cyl > 6). The expression is evaluated once per row.
Does by sort the result?
No. Plain by returns groups in the order they first appear in the input, which matches the speed of the implementation. Use keyby if you need the result sorted by the grouping columns; that form also stores a key, so later subsets and joins on the result use binary search.