data.table .SD in R: Subset of Data for Per-Group Operations
The .SD symbol in data.table stands for Subset of Data and represents a data.table of the current group's rows inside j, so any function that takes a data frame can be applied once per group.
dt[, lapply(.SD, mean), by = cyl] # mean of every column per group dt[, lapply(.SD, mean), by = cyl, .SDcols = c("mpg","hp")] # restrict to columns dt[, .SD[1L], by = cyl] # first row per group dt[, .SD[.N], by = cyl] # last row per group dt[, .SD[which.max(mpg)], by = cyl] # row with max mpg per group dt[, head(.SD, 2), by = cyl] # top 2 rows per group dt[, (cols) := lapply(.SD, scale), .SDcols = cols] # update many columns in place
Need explanation? Read on for examples and pitfalls.
What .SD does in one sentence
.SD is a data.table that holds the current group's rows. Inside j, .SD refers to all columns of the table except the ones listed in by, unless you narrow it with .SDcols. data.table builds a fresh .SD for every group and exposes it as a local variable, so lapply(.SD, fn) runs fn once per column per group.
Without by, .SD is the whole table. The name reads as "Subset of Data" and matches that intuition: it is whichever slice j is currently working on.
Syntax
.SD and .SDcols are paired symbols that only have meaning inside DT[i, j, by]. You use .SD as a value (often inside lapply or head) and .SDcols as an argument to the bracket call that controls which columns .SD contains.
.SDcols accepts:
- A character vector of column names, like
c("mpg", "hp"). - A numeric or negative index, like
2:4or-1. - A
patterns()helper that matches by regex, likepatterns("^d"). - A function predicate, like
is.numeric(one column kept if the predicate returnsTRUE).
The grouping columns named in by are always excluded from .SD so you do not double-summarize them.
Examples by use case
Build a single dt and reuse it for every example. Every block below works on the same table created from mtcars.
Summarize every numeric column per group with lapply(.SD, mean). Without .SDcols, .SD contains every column except cyl.
Pick the first or last row per group with .SD[1L] or .SD[.N]. .SD is a data.table, so you can subscript it like any other table.
Pick the row with the highest value per group using .SD[which.max(...)]. This pattern returns whole rows, not just the maximum value.
.SD turns "per group" into "per data frame". Any function that already takes a data frame, like head(), tail(), lm(), or a custom function returning a data.table, slots straight in as fn(.SD) and runs once per group. The group key columns are stitched back on automatically.Update many columns in place with := and lapply(.SD, fn). Wrap the target column names in parentheses on the left so data.table treats them as a vector of names.
.SD vs .SDcols vs explicit columns
Pick .SD when the function applies uniformly to many columns. Pick a named j when each output column needs its own expression.
| Pattern | Best when | Example |
|---|---|---|
lapply(.SD, fn) with .SDcols |
Same function on many columns | dt[, lapply(.SD, mean), .SDcols = num_cols] |
Named j in .() |
Different function per column | dt[, .(m = mean(mpg), s = sd(hp)), by = cyl] |
.SD[1L] or .SD[.N] |
Pick rows by position per group | dt[, .SD[1L], by = cyl] |
(cols) := lapply(.SD, fn) |
Update many columns in place | dt[, (cols) := lapply(.SD, scale), .SDcols = cols] |
lapply(.SD, fn, ...) extra args |
Pass arguments to fn per call |
dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = cols] |
The decision rule is short. Reach for .SD whenever a column-loop would otherwise repeat the same expression with different column names.
Common pitfalls
Omitting .SDcols makes .SD include every non-grouping column. lapply(.SD, mean) then warns or returns NA on character columns and dates. Pass .SDcols whenever the table has mixed types.
.SD is read-only inside j. Assigning to .SD$col does nothing useful; the change is discarded when j returns. To modify columns in place, use the bracketed (cols) := lapply(.SD, fn) form with .SDcols, which writes back through :=..SDcols = patterns("regex") to select columns by name pattern. This avoids hard-coding column lists when the table has many similarly named columns. patterns("^q|^m") matches every column whose name starts with q or m.Try it yourself
Try it: Using the airquality dataset as a data.table, compute the mean of Ozone, Solar.R, Wind, and Temp per Month, ignoring NA. Save the result to ex_sd.
Click to reveal solution
Explanation: Passing na.rm = TRUE as an extra argument to lapply(.SD, mean, ...) forwards it to every per-column call. .SDcols restricts .SD to the four numeric columns, so the result is one row per Month with four mean columns.
Related data.table functions
.SD is one part of the DT[i, j, by] toolkit. Explore these next:
.SDcols: the column filter that controls what.SDcontains.byandkeyby: the row-splitters that drive per-group.SDcreation.:=: in-place update operator, often paired withlapply(.SD, fn).setDT(): convert a data frame to a data.table without copying.frollmean(): rolling means that compose with.SDfor window summaries.
See the official data.table .SD vignette for the canonical reference.
FAQ
What does .SD stand for in data.table?
.SD stands for Subset of Data. Inside j, it is a data.table containing the rows of the current group and all columns except the ones in by, unless restricted by .SDcols. The name reflects what it holds: whichever slice the current j call is working on. data.table builds a fresh .SD for every group and discards it once j returns.
When should I use .SD instead of a named j expression?
Use .SD when the same operation applies to many columns and would otherwise be repeated with different column names. lapply(.SD, mean) replaces a list of named means. Use a named j wrapped in .() when each output column needs a different expression, like mean(x) for one column and sd(y) for another. The two patterns coexist freely in the same query.
What is the difference between .SD and .SDcols?
.SD is the data; .SDcols is the filter that picks which columns end up in .SD. Without .SDcols, .SD contains every non-grouping column. With .SDcols = c("mpg", "hp"), .SD is restricted to those two columns. .SDcols accepts character vectors, indices, patterns() regex, and function predicates like is.numeric.
Can I modify columns through .SD?
No, not directly. .SD is read-only inside j, so .SD$col <- value is discarded when j returns. To update many columns in place, use dt[, (cols) := lapply(.SD, fn), .SDcols = cols]. The (cols) form on the left tells data.table to write back through :=, and .SDcols ensures lapply walks only the target columns.
Why does lapply(.SD, mean) return NA on character columns?
Because .SD contains every non-grouping column by default, including character and date columns where mean() is undefined. Pass .SDcols with only the numeric columns, or use .SDcols = is.numeric to let data.table pick numeric columns automatically. The same applies to other column-summary functions like sum() and sd().