data.table Update by Reference in R: := and set()
Update by reference in data.table modifies a column in place using the := operator or the set() function. The original object changes directly, no copy is made, and the result is far faster and more memory-light than R's default copy-on-modify behavior.
DT[, new_col := x * 2] # add or overwrite one column DT[, c("a", "b") := list(x + 1, y + 1)] # several columns at once DT[, `:=`(a = x + 1, b = y + 1)] # functional form, same effect DT[cyl == 4, mpg := mpg * 1.1] # update rows matching a filter DT[, new_col := NULL] # delete a column by reference set(DT, i = 1:3, j = "x", value = NA) # set() updates specific cells for (j in names(DT)) set(DT, j = j, value = 0) # set() shines inside loops
Need explanation? Read on for examples and pitfalls.
What update by reference means in one sentence
Update by reference changes an existing object in memory instead of returning a modified copy. When you write DT[, x := 1], the column x is added or overwritten inside the same data.table that lives at DT. Nothing is copied, and no assignment with <- is needed. Any other variable bound to the same object sees the change too.
This is the opposite of how base R and tidyverse data frames behave. df$x <- 1 makes a fresh copy of df, modifies the copy, and rebinds the name df to it. The distinction looks subtle on small data but becomes the dominant cost on large tables.
How update by reference differs from copy-on-modify
R's default rule is copy-on-modify. Whenever you change a value inside a data frame, R quietly duplicates the underlying storage so the original is not affected. That guarantees safety but pays for it with memory and time, especially on wide or tall tables.
data.table opts out of this rule for a specific set of operations. :=, set(), and the set* family (setnames(), setcolorder(), setorder(), setkey(), setattr()) all modify the existing object directly. The trade-off is awareness: aliases of the same table now point to one shared object, so changes propagate.
<- is unnecessary, and aliases share state. Once that mental model clicks, the rest of the data.table API stops feeling surprising.The two main entry points: := and set()
:= is the everyday tool; set() is the loop tool. Both modify by reference; they differ in syntax and where each shines.
:= lives inside the j slot of DT[i, j, by]. It can add, overwrite, or delete columns, and it can be combined with an i filter to update only matching rows. set() is a standalone function that takes i, j, and value arguments. It avoids the small overhead of the [.data.table dispatcher, which matters when you call it thousands of times in a loop.
Examples by use case
Add a single column by reference. No <- is needed; the table itself changes.
Add several columns in one call. The multi-column form takes character vectors and a list() of values.
Update only the rows that match a filter. Combine i and := so the assignment touches just the targeted rows.
Delete a column by setting it to NULL. This is the canonical drop pattern; the column vanishes from the table in place.
Use set() when you are looping. The set() function avoids the per-call overhead of [.data.table, which adds up when you touch many cells.
Compare with copy-on-modify alternatives
Choose the right tool by how much data you touch and whether you need the original untouched. The table below summarises the trade-offs.
| Operation | Copies data? | Returns object? | Best for |
|---|---|---|---|
DT[, x := value] |
No | No (invisible) | Most column updates |
set(DT, i, j, value) |
No | No (invisible) | Many small updates in a loop |
setnames(DT, ...) |
No | No (invisible) | Rename columns in place |
DT$x <- value (base R) |
Yes (whole DT) | No | Avoid on data.tables |
dplyr::mutate(df, ...) |
Yes | New tibble | When you must keep the original |
copy(DT) then modify |
Yes (once) | New data.table | When aliases must not see the change |
Decision rule: default to := for clarity, switch to set() when profiling shows the [.data.table overhead matters, and reach for copy() only when an alias must stay frozen.
set() in a loop, not :=. Each DT[, := ] call goes through [.data.table, which adds tens of microseconds of overhead per call. With set(), that overhead disappears, and updating thousands of cells one by one becomes practical.Common pitfalls
Pitfall 1: assigning the result back to a new name. := returns invisibly, so writing DT2 <- DT[, x := 1] rebinds DT2 to the same object as DT. Both names now point to one mutable table, and the next update touches both.
The fix is to use copy() whenever you need an independent snapshot.
Pitfall 2: silent output at the console. DT[, x := 1] prints nothing because the return value is invisible. New users sometimes think the call failed. It did not; DT was updated. Print DT explicitly to confirm, or chain [] at the end: DT[, x := 1][].
Pitfall 3: using := on a regular data.frame. := is a data.table feature. Calling it on a plain data.frame raises a confusing error. Convert with setDT(df) first.
Try it yourself
Try it: Convert iris to a data.table, then add a column petal_area equal to Petal.Length * Petal.Width by reference. Confirm the column appears in the original object without reassignment.
Click to reveal solution
Explanation: The := call adds petal_area to ex_iris directly. No assignment with <- is needed because data.table modifies the object in place.
Related data.table functions
A short list of reference-modifying helpers worth knowing:
setnames(): rename columns in placesetcolorder(): reorder columns in placesetorder(): sort rows in placesetkey(): set a key in place for fast joins and lookupssetattr(): change attributes (class, names) in placecopy(): opt out of reference semantics when you need a true duplicate
For the broader contrast with tidyverse copy-on-modify, see the parent guide on data.table vs dplyr. The official reference is the data.table semantics vignette.
FAQ
Why does data.table modify in place when most R does not?
Modifying in place avoids the cost of duplicating large objects. On a million-row table, copy-on-modify can mean copying tens or hundreds of megabytes for each column change. data.table sacrifices the safety net of automatic copies to gain that speed and memory headroom, and exposes copy() for cases where the safety net is what you actually want.
Does := work on a data.frame?
No. The := operator is defined only inside [.data.table. Calling it on a regular data.frame returns an error about an unexpected operator. Convert the data.frame first with setDT(df) or as.data.table(df), then := becomes available.
When should I prefer set() over :=?
Use set() inside loops or when you are touching many individual cells. := is more readable for single-column or multi-column updates outside a loop, while set() avoids the small dispatch overhead of [.data.table and is faster for high-frequency updates.
Does := return the updated table?
:= returns the table invisibly, so a console call shows nothing. If you want to print the result in one line, append []: DT[, x := 1][]. Inside a pipeline, the invisible return still works; only the auto-print behavior is suppressed.
How do I undo an update by reference?
There is no automatic undo. The original column values are gone once := runs. If you might need the previous state, call copy(DT) before the update and keep the copy. For column deletion specifically, you can reassign the column with := if you still have its values in another object.