dplyr rows_upsert() in R: Insert New OR Update Existing
The rows_upsert() function in dplyr is the SQL-style UPSERT: rows from y replace matching rows in x by key, and rows in y with new keys are appended. It is "insert or update" in one verb.
rows_upsert(x, y, by = "id") # insert new + update existing rows_insert(x, y, by = "id") # insert only; error on dup rows_update(x, y, by = "id") # update only; error on new rows_patch(x, y, by = "id") # update only NA cells bind_rows(x, y) |> distinct(id, .keep_all = TRUE) # rough alternative
Need explanation? Read on for examples and pitfalls.
What rows_upsert() does in one sentence
rows_upsert(x, y, by) returns x with: existing rows whose key matches y replaced by y's values, AND new rows from y (whose keys aren't in x) appended. It is "insert or update".
This is the standard incremental-sync operation: take a fresh batch, merge it with the master, end up with the union of both.
Syntax
rows_upsert(x, y, by = NULL, in_place = FALSE). Combines rows_insert + rows_update.
rows_upsert is the dplyr equivalent of SQL INSERT ... ON CONFLICT (key) DO UPDATE. It is the standard sync verb when y might contain a mix of new and updated records.Five common patterns
1. Standard upsert
2. Multi-column update
All non-key columns in batch overwrite (or extend) master's columns.
3. Composite key
4. Subset of columns updated
batch_score_only has no name column, so master's name is preserved.
5. Idempotent batch
upsert is idempotent: applying the same batch twice gives the same result.
rows_upsert is "merge with replacement". Existing rows are OVERWRITTEN by their match in y; new rows are APPENDED. The semantics match SQL MERGE and Postgres INSERT ON CONFLICT DO UPDATE. Use it for incremental sync where the batch is the source of truth for the keys it contains.rows_upsert() vs rows_insert() vs rows_update() vs bind_rows()
Four "merge x and y" strategies in dplyr.
| Function | Updates? | Inserts? | Errors |
|---|---|---|---|
rows_insert |
No | Yes | On duplicate key |
rows_update |
Yes | No | On new key in y |
rows_upsert |
Yes | Yes | Never (in normal usage) |
bind_rows |
No | Appends all | Never (no key) |
rows_patch |
Only NA cells | No | On new key in y |
When to use which:
rows_upsertfor incremental sync.rows_insertwhen duplicates indicate a bug.rows_updatefor corrections-only.bind_rowswhen keys don't matter.
A practical workflow
The "incremental ETL load" pattern uses rows_upsert.
Each day's load merges with the master. New keys append; existing keys update. No row count surprises; idempotent.
For audit logging, capture diffs first:
Common pitfalls
Pitfall 1: column order matters when types differ. y's columns must match x's types. A type mismatch errors at the call site.
Pitfall 2: NAs in batch overwrite good data. If batch has NA in a column, the corresponding master row gets NA. Use rows_patch if you only want to fill NAs in master.
rows_upsert is NOT in-place by default. It returns a new data frame. For data.tables with in_place = TRUE, modification can be in place; for tibbles always assign the result.Idempotency makes upsert pipelines safe
Idempotency means applying the same operation twice produces the same result as applying it once. rows_upsert is idempotent: re-running yesterday's batch on an already-updated master leaves the master unchanged. This property is enormously useful for ETL pipelines that may retry on failure or be re-run after manual fixes. Without idempotency, retries can multiply rows or corrupt data; with it, retries are safe by construction. The same property holds for rows_update and rows_patch, but not for rows_insert (which errors on retry due to duplicate keys) or bind_rows (which appends every time).
Try it yourself
Try it: Apply a batch update + insert to a small master table. Save to ex_synced.
Click to reveal solution
Explanation: rows_upsert updates id=2 with the batch value and appends id=4 as a new row.
Related dplyr functions
After mastering rows_upsert, look at:
rows_insert(): insert onlyrows_update(): update onlyrows_patch(): only fill NA cellsrows_delete(): remove by keyrows_append(): bind_rows alias (dplyr 1.1+)mutate(): modify without key matching
For point-in-time data versioning (preserve history), tools like dplyr::rows_* aren't sufficient: you need a proper SCD-2 pattern with effective dates.
FAQ
What does rows_upsert do in dplyr?
rows_upsert(x, y, by) updates rows of x whose keys match y, AND inserts y's rows whose keys aren't in x. The "insert or update" verb.
What is the difference between rows_upsert and rows_insert?
rows_insert errors if any of y's keys already exist. rows_upsert overwrites the existing rows AND inserts new ones. Use upsert for sync; insert for strict no-duplicates.
Is rows_upsert idempotent?
Yes. Applying the same batch twice produces the same result. Useful for retry-safe pipelines.
Does rows_upsert overwrite NAs?
Yes. If y has NA in a column, the corresponding x row gets NA. Use rows_patch to fill ONLY x's existing NAs without overwriting non-NA values.
Can rows_upsert handle composite keys?
Yes: rows_upsert(x, y, by = c("col1","col2")). Tuple matching applies.