dplyr union() in R: Combine Rows From X and Y, Deduplicated
The union() function in dplyr stacks rows from x and y vertically, REMOVING DUPLICATES. It is the SQL UNION equivalent. Both inputs must have the same columns.
union(x, y) # rows in x OR y (deduplicated) union_all(x, y) # union without dedup (= bind_rows) intersect(x, y) # rows in BOTH setdiff(x, y) # rows in x not in y bind_rows(x, y) # stack vertically (no dedup)
Need explanation? Read on for examples and pitfalls.
What union() does in one sentence
union(x, y) stacks rows from both x and y and removes duplicates based on whole-row equality. Both inputs must have the same columns and types.
It is the data-frame version of mathematical set union. Use it when you want a deduplicated combined table.
Syntax
union(x, y). Both must have identical columns (same names and order).
union_all (or bind_rows) if you need to keep duplicates. union always deduplicates, which can hide important counts.Five common patterns
1. Combine and dedup
2. Multi-column dedup
3. union_all (no dedup)
union_all and bind_rows produce the same result when columns align.
4. Combine three tables
union takes 2 args; chain with reduce for many.
5. Compare with intersect and setdiff
union, intersect, setdiff operate on WHOLE ROWS, not on keys. Two rows are "equal" only if every column matches. For key-based set operations, use semi_join (intersect) or anti_join (setdiff). For non-deduplicating combination, use bind_rows.union() vs union_all() vs bind_rows() vs base union
Four ways to combine rows in R.
| Function | Dedup | Same columns required |
|---|---|---|
dplyr::union(x, y) |
Yes | Yes |
dplyr::union_all(x, y) |
No | Yes |
dplyr::bind_rows(x, y) |
No | No (fills missing with NA) |
base::union(x, y) |
Yes | Vector input, not df |
rbind(x, y) |
No | Yes (errors if mismatch) |
When to use which:
unionfor set-style combination (dedup).union_allfor "stack but enforce same columns".bind_rowsfor flexible combination (different columns ok).base::unionfor vector inputs.
A practical workflow
Use union for "merge two sources of records, drop duplicates" workflows.
Both sources contribute rows; identical rows appear once. For "any row in either" without caring about duplication, union_all is faster.
For multi-source combination:
Common pitfalls
Pitfall 1: column mismatch errors. union(x, y) requires identical column names AND order. Use bind_rows or select to align first.
Pitfall 2: dedup hides count information. union(x, x) returns x (deduplicated), not 2x. If row counts matter, use union_all or bind_rows.
dplyr::union masks base::union. After library(dplyr), plain union(...) calls dplyr's data-frame version. For vector inputs, use base::union(...) explicitly.When union beats bind_rows + distinct
The two equivalent ways to "combine and dedup" are union(x, y) and bind_rows(x, y) |> distinct(). They produce the same result when x and y have identical columns. union is more concise and signals intent (set semantics). The bind_rows + distinct chain is more flexible (allows different columns, fills with NA) but verbose. For pipelines where you control both inputs and want strict structure validation, union is preferred. For combining heterogeneous sources where columns may differ, bind_rows is more forgiving.
Try it yourself
Try it: Combine two halves of mtcars (rows 1-16 and 15-32, with overlap) and dedup. Save to ex_full.
Click to reveal solution
Explanation: Rows 15 and 16 appear in both halves. union deduplicates them, returning 32 unique rows.
Related dplyr / base functions
After mastering union, look at:
dplyr::intersect(): rows in bothdplyr::setdiff(): rows in x not in ydplyr::union_all(): union without dedupdplyr::bind_rows(): flexible combinationbase::rbind(): strict same-shape rbindbase::union()/base::intersect()/base::setdiff(): vector versions
For combining tables with different columns, bind_rows() is more flexible than union.
FAQ
What does union do in dplyr?
union(x, y) combines rows from x and y, removing duplicates based on whole-row equality. Both inputs must have the same columns.
What is the difference between union and union_all in dplyr?
union deduplicates; union_all keeps all rows (equivalent to bind_rows when columns align).
What is the difference between union and bind_rows?
union dedupes and requires identical columns. bind_rows keeps duplicates and fills missing columns with NA. bind_rows is more flexible; union enforces structure.
Does dplyr::union mask base::union?
Yes. After loading dplyr, plain union calls the data-frame version. For vector union, use base::union(...).
Can I union three or more tables?
Yes, via reduce: purrr::reduce(list(x, y, z), union) or chained: x |> union(y) |> union(z).