dplyr intersect() in R: Rows in Both X and Y
The intersect() function in dplyr returns rows that appear in BOTH x and y, using whole-row equality. It is the data-frame version of mathematical set intersection.
intersect(x, y) # rows in both (whole-row eq) union(x, y) # rows in either, deduplicated setdiff(x, y) # rows in x not in y semi_join(x, y, by = "id") # rows in x with key match in y base::intersect(c(1,2), c(2,3)) # vector intersect (different)
Need explanation? Read on for examples and pitfalls.
What intersect() does in one sentence
intersect(x, y) returns rows present in BOTH x and y, using whole-row equality, with duplicates removed. Both inputs must have the same columns and types.
The opposite of setdiff: where setdiff returns rows unique to x, intersect returns rows shared by both.
Syntax
intersect(x, y). Both inputs must have identical columns. Returns deduplicated unique rows.
dplyr::intersect masks base::intersect. Inside dplyr workflows, you get the data-frame version. For vector intersection, prefix with base::.Five common patterns
1. Find common rows
2. Multi-column intersection
3. Using set operations together
4. vs semi_join (key-based)
5. With reduce for many tables
dplyr::intersect operates on WHOLE ROWS, not on keys. Two rows are equal only if every column matches. For key-based "rows present in both", use semi_join. For value-level "are these snapshots identical", use intersect.intersect() vs semi_join() vs base intersect
Three "intersection" operations in R.
| Function | Scope | Matches by |
|---|---|---|
dplyr::intersect(x, y) |
Whole rows of df | All columns |
semi_join(x, y, by) |
Rows of df | Key columns only |
base::intersect(x, y) |
Vector elements | Equality |
When to use which:
dplyr::intersectfor whole-row dataset comparison.semi_joinfor key-based row filter.base::intersectfor vector intersection.
A practical workflow
Use intersect to find rows that are identical across two snapshots.
For cohort overlap analysis:
Common pitfalls
Pitfall 1: column mismatch errors. Both inputs must have identical column names AND order. Use select to align first.
Pitfall 2: confusing with semi_join. intersect is whole-row; semi_join is by-key. Pick by question.
dplyr::intersect masks base::intersect. After loading dplyr, plain intersect(c(1,2,3), c(2)) errors because dplyr expects data frames. For vectors, use base::intersect(...).Why use intersect at all
Intersect occupies a niche between joins and filtering: it answers "are these two row sets the same?". For most data work the answer is "no, use a join", but a small number of patterns are cleaner with intersect. Think of intersect as one of the four set operations (union, intersect, setdiff, complement) that complete the algebra of row-set comparisons. They form a closed family: any pairwise comparison of two row sets can be expressed using these. For one-off snapshot diffs, this algebra is direct and self-documenting; for real analytic work joins are usually clearer.
Try it yourself
Try it: Find which rows appear in both top_5_v1 and top_5_v2 snapshots. Save to ex_common.
Click to reveal solution
Explanation: Datsun 710 is in v1 but not v2; the intersection has the other 4 rows.
Related dplyr / base functions
After mastering intersect, look at:
dplyr::union(): combine + dedupdplyr::setdiff(): rows in x not in ysemi_join(): key-based intersectionbase::intersect(): vector intersectiondplyr::distinct(): dedup without intersectiondplyr::all_equal()/waldo::compare(): rich diff output
For element-wise intersection in pipelines (vectors), base::intersect() is the right tool.
When to use intersect over the alternatives
Intersect is the rare-case tool: most data work is better served by joins. It is right when (a) both tables represent SETS of rows where order doesn't matter, (b) you care about whole-row equality not key matching, and (c) duplicates are noise to be removed. For all other cases, semi_join (key-based) or inner_join (key-based + adds columns) is more appropriate. The narrow but important use case: comparing two SNAPSHOTS to find unchanged rows, or finding overlap between two LISTS of records.
FAQ
What does intersect do in dplyr?
dplyr::intersect(x, y) returns rows present in both x and y, using whole-row equality. Both inputs need the same columns. Duplicates are removed.
What is the difference between intersect and semi_join?
intersect uses WHOLE-ROW equality. semi_join uses KEY equality. If x has (id=2, val="B") and y has (id=2, val="b"), intersect drops the row but semi_join keeps it.
Does dplyr::intersect mask base::intersect?
Yes. After library(dplyr), intersect calls the data-frame version. For vectors, use base::intersect(...) explicitly.
How do I intersect more than 2 tables?
Chain or reduce: purrr::reduce(list(x, y, z), intersect) or x |> intersect(y) |> intersect(z).
Can I intersect tables with different columns?
No. dplyr::intersect errors. Use select() to align columns first, or use semi_join if you only care about key matching.