data.table keyby in R: Group, Sort, and Key the Result
The keyby keyword in data.table groups rows for aggregation inside DT[i, j, keyby], then sorts the output and sets a key on the grouping columns so downstream joins and subsets run with binary search.
dt[, .N, keyby = cyl] # count per group, sorted by cyl dt[, mean(mpg), keyby = cyl] # one summary per group dt[, .(avg = mean(mpg)), keyby = cyl] # name the output column dt[, mean(mpg), keyby = .(cyl, gear)] # group on two columns dt[, mean(mpg), keyby = "cyl"] # column as character dt[mpg > 18, .N, keyby = cyl] # filter then group dt[, lapply(.SD, mean), keyby = cyl] # summarize every column
Need explanation? Read on for examples and pitfalls.
What keyby does in one sentence
keyby is by with two extra side effects on the result. You pass one or more grouping columns; data.table splits the table, runs the j expression per group, and stacks the rows into a new table whose grouping columns are sorted ascending. It also calls setkey() on that result, so subsequent lookups can use binary search.
The function name reads as a verb phrase: "key by these columns". Read the call site the same way and the behavior follows: group, sort, key.
Syntax
keyby is the third argument in DT[i, j, keyby]. It accepts the same forms as by, plus the result inherits a key:
The accepted forms are:
keyby = col: one unquoted column name.keyby = .(col1, col2): several unquoted column names.keyby = "col"orkeyby = c("col1", "col2"): column names as character.keyby = expression: any expression that returns a vector or list of vectors with one entry per grouped row.
The output is always a new data.table with one row per group, sorted by the grouping columns, with key(result) set to those columns.
Examples by use case
Build a data.table once and reuse it across examples. Every block below shares the same dt object created from mtcars.
Count rows per group. The classic group-and-count idiom returns one row per unique cylinder count, sorted ascending:
.N is the row count for each group, and key() confirms the result is keyed on cyl.
Summarize one column per group. Wrap the aggregation in .() to name the output column:
The result has three columns (cyl, avg_mpg, n) and is keyed on cyl, so result[J(4)] would return the 4-cylinder row instantly.
Group on multiple columns. Pass a list with .():
Rows are sorted first by cyl, then by gear within each cyl. Both columns become the compound key.
Filter then group. Anything in i runs before grouping, so you can restrict the rows being aggregated:
Filter, group, sort, key, in a single expression.
keyby is one keystroke past by, but it changes the result's shape. The summary table becomes a sorted, keyed lookup, which is exactly what you want when the aggregation feeds the next join, plot, or report.keyby vs by
The two arguments compute identical aggregates; they differ only in what they leave behind. by preserves the order in which groups first appear in the source; keyby sorts groups ascending and attaches a key.
Use by when you need to preserve a meaningful row order from the source (for instance time-of-arrival). Use keyby when the output will be joined, indexed, or printed in a table that benefits from sorted groups.
The full contrast in one table:
| Behavior | by |
keyby |
|---|---|---|
Group rows for j |
yes | yes |
| Output row order | first-seen in source | ascending by group columns |
| Sets key on result | no | yes (binary search enabled) |
| Modifies source data.table | no | no |
| Typical use case | preserve source order | feed joins, lookups, reports |
Why a keyed result is faster
Binary search on a keyed data.table runs in O(log n), not O(n). After keyby, the result is set up for instant lookups using the J() helper:
J(6) is shorthand for a one-row data.table that becomes the join target. Because summary_dt is keyed, data.table jumps straight to the matching row.
keyby whenever the aggregation feeds another join or filter. A keyed summary makes the next call cheaper without an extra setkey() line.Common pitfalls
NA groups appear first, not last. data.table's key ordering places NA values before all other values. If you forget that, an NA row at the top of your summary can look like a bug:
keyby keys the RESULT, not the source. The original dt is untouched. If you want the source itself sorted and keyed, use setkey(dt, cyl) or setorder(dt, cyl) instead. Confusing the two is the most common keyby mistake.Quoted column names need character form. Inside keyby, you can pass names unquoted (keyby = cyl) or as character (keyby = "cyl"), but not as a symbol stored in a variable unless you wrap with get() or use ..varname. Mixing the two breaks at runtime.
Expression grouping creates an unnamed key column. dt[, .N, keyby = mpg > 20] returns a column literally named mpg > 20. Name it explicitly with keyby = .(high_mpg = mpg > 20) if you plan to reference it later.
Try it yourself
Try it: Group mtcars by gear and compute mean hp per group, sorted by gear. Save the result to ex_hp_by_gear and confirm it has a key on gear.
Click to reveal solution
Explanation: keyby = gear groups by gear, sorts the three groups ascending, and sets gear as the key on the returned data.table. mean(hp) runs once per group.
Related data.table functions
Pair keyby with these neighbors to shape and query grouped output:
by: aggregate without sorting or keying the result.setkey(): key an existing data.table in place, without aggregating.setorder(): sort an existing data.table in place by any columns..SDand.SDcols: applyjexpressions to many columns at once inside grouped queries.frollmean(): rolling means that often follow akeybystep.
For a side-by-side syntax tour against the tidyverse, see the data.table vs dplyr hub.
FAQ
What is the difference between by and keyby in data.table?
Both arguments group rows for the j expression, and both return one summary row per group. by keeps the order in which groups first appear in the source table. keyby sorts the result ascending by the grouping columns and calls setkey() on the result so future joins and lookups can use binary search. The aggregation itself is identical; only the post-processing differs.
Does keyby modify the original data.table?
No. keyby keys the returned summary, not the source. Your original dt retains whatever key it had (or none) before the call. To key the source itself, use setkey(dt, col). To sort the source rows without changing key state, use setorder(dt, col).
Can I keyby on more than one column?
Yes. Pass the columns inside .(), for example keyby = .(cyl, gear). The result is sorted first by the leftmost column, then by the next, and so on. The compound key matches the column order, so key(result) returns c("cyl", "gear").
Why does my keyby result have NA at the top?
data.table's key ordering places NA values before all other values. After keyby, any group whose grouping column is NA appears in the first row of the result. This is consistent with how setkey() treats missing values across the package.
Is keyby slower than by because of the sort step?
For a single aggregation, the sort and key cost is negligible compared with the group-by work itself. When the result feeds another join or repeated subset, keyby is faster overall because the downstream binary-search lookups avoid an extra setkey() pass.
For the canonical reference, see the data.table FAQ on grouping.