dplyr anti_join() in R: Keep Rows Without Match in Right
The anti_join() function in dplyr keeps rows from the LEFT table that DO NOT have a match in the right table. It is the opposite of semi_join() and the standard tool for "find rows missing from the master list".
anti_join(x, y, by = "id") # x rows NOT in y anti_join(x, y, by = c("id","date")) # multi-key semi_join(x, y, by = "id") # opposite: x rows IN y filter(x, !(id %in% y$id)) # equivalent for one-col key anti_join(orders, customers, by = "id") # orphan orders anti_join(actual, expected, by = "id") # missing from expected
Need explanation? Read on for examples and pitfalls.
What anti_join() does in one sentence
anti_join(x, y, by) returns rows of x whose by key DOES NOT appear in y, with NO columns added from y. It is a filter: same column count as x, fewer (or equal) rows.
The cleanest tool for "what's in A but not in B?" questions, especially with multi-column keys.
Syntax
anti_join(x, y, by = NULL). Same key arguments as semi_join.
anti_join is the dplyr idiom for data-quality audits: "what records exist in source A but not in source B?". Cleaner than the filter(... !%in% ...) workaround, especially with multi-column keys.Five common patterns
1. Find missing customers
2. Multi-column anti-join
For multi-column negation, filter(... %in% ...) is awkward; anti_join is clean.
3. Find orphan records
4. New rows since last snapshot
A diff pattern: rows in current but not previous.
5. Pair with semi_join for partition
semi + anti partitions x into two non-overlapping groups.
anti_join is the ONLY clean way to express "in left but not in right" for multi-column keys. With single columns you could use filter(x, !id %in% y$id); with two-column keys, the equivalent filter is awkward and error-prone. anti_join is unambiguous.anti_join() vs semi_join() vs filter() vs setdiff()
Four ways to express "exclude" in R.
| Approach | Best for |
|---|---|
anti_join(x, y) |
Multi-key membership negation |
semi_join(x, y) |
Opposite (positive membership) |
filter(x, !(col %in% y$col)) |
Single-column negation in a quick script |
setdiff(x, y) |
Whole-row difference (all columns must match) |
When to use which:
anti_joinfor key-based exclusion.setdifffor whole-row set difference.filterfor single-column scripts.
A practical workflow
Use anti_join in data audits to find what's expected vs what's there.
A pair of anti_joins gives a complete diff between two sources.
For incremental loads:
Only insert rows that don't already exist.
Common pitfalls
Pitfall 1: forgetting which side has the negation. anti_join(x, y) returns rows from X NOT in Y. Reversing the args inverts the meaning.
Pitfall 2: NA in the join column. NA == NA is NA, so rows with NA in the key may NOT match (and thus appear in anti_join output). Filter NAs first if surprising.
anti_join does NOT count duplicates in y. A single match in y is enough to exclude an x row, regardless of how many duplicates y has. Same semantics as semi_join (membership only).Why anti_join beats filter(!%in%)
For multi-column keys, anti_join is the only clean expression of "exclude these tuples". The filter equivalent filter(!(col1 %in% y$col1) & !(col2 %in% y$col2)) is wrong: it tests each column independently, not as a tuple. The correct filter form requires building a composite key string or using paste0 tricks, both of which are clunky. anti_join handles tuple membership natively and clearly. Single-column negation can use filter(!col %in% y$col), which is fine for quick scripts; for production code, anti_join is more robust against accidental key changes.
Try it yourself
Try it: Find all mtcars cars whose name is NOT in a featured lookup. Save to ex_unfeatured.
Click to reveal solution
Explanation: anti_join keeps cars whose name is NOT in featured. 30 of 32 mtcars survive.
Related dplyr functions
After mastering anti_join, look at:
semi_join(): opposite filtering joininner_join()/left_join(): mutating joinssetdiff(): whole-row set difference (requires same columns)filter(): single-column negationmatch(),%in%: base R membership tests
For "rows that exist in both AND match value-for-value", intersect() works on whole-row equality.
FAQ
What does anti_join do in dplyr?
anti_join(x, y, by) keeps rows of x whose by key does NOT appear in y. Used to find records missing from a master list or new since a previous snapshot.
What is the difference between anti_join and semi_join?
They are opposites. semi_join keeps x rows that DO match in y. anti_join keeps x rows that DO NOT match in y. Together they partition x.
How does anti_join differ from setdiff?
anti_join uses a key (subset of columns); setdiff requires whole-row equality. Use anti_join for key-based exclusion; setdiff for full-row difference.
Can anti_join handle multi-column keys?
Yes. anti_join(x, y, by = c("region","product")) excludes x rows whose (region, product) tuple appears in y.
Why does anti_join handle NA strangely?
Because NA == NA is NA, not TRUE. NA values in the join key may not match, so rows with NA appear in anti_join output. Filter NAs first if needed.