dplyr inner_join() in R: Keep Only Matched Rows
The inner_join() function in dplyr keeps ONLY rows that have matches in BOTH tables, returning the intersection. Unmatched rows from either side are DROPPED.
inner_join(x, y, by = "id") # only matched rows inner_join(x, y, by = c("id","date")) # multi-column key inner_join(x, y, by = c("id" = "uid")) # mapped key inner_join(x, y, by = join_by(id)) # tidy syntax left_join(x, y, by = "id") # different: keep all x semi_join(x, y, by = "id") # filter only, no y columns
Need explanation? Read on for examples and pitfalls.
What inner_join() does in one sentence
inner_join(x, y, by) returns rows where the by key appears in BOTH x and y, with columns from both sides combined. Unmatched rows from either table are dropped.
It is the strictest of the mutating joins: rows survive only if they have a partner. Useful when you only care about complete records.
Syntax
inner_join(x, y, by = NULL, suffix = c(".x", ".y")). Same arguments as left_join.
inner_join when you only want to analyze records that have data in BOTH tables. It is the cleanest way to express "I need a match" without filtering after a left_join.Five common patterns
1. Standard inner join
Drops customers with no orders AND orders with no customer.
2. Multi-column key
3. Different column names
4. Self-inner-join
Self-join with filter to find pairs.
5. Chained inner joins
Each step drops unmatched rows from the prior result.
inner_join is the SAFEST join for downstream analysis. No NAs introduced by missing matches. The trade-off: rows you forgot existed (without partners) silently disappear. Always check nrow(result) after an inner_join to make sure the drop count matches your expectation.inner_join() vs left_join() vs semi_join() vs filter
Four ways to combine "match against y" semantics in dplyr.
| Approach | Keeps | Adds y columns? |
|---|---|---|
inner_join(x, y) |
x rows that match y | Yes |
left_join(x, y) |
All x rows | Yes (NA where no match) |
semi_join(x, y) |
x rows that match y | No |
filter(x %in% y) |
x rows where col is in y | No |
When to use which:
inner_joinfor "match AND augment".semi_joinfor "match WITHOUT augmenting" (filter only).left_joinfor "augment, keep unmatched as NA".filter(... %in% ...)for simple membership tests.
A practical workflow
The "complete cases only" pattern is inner_join's signature use case.
Only experiments with both metadata and outcomes survive. NAs are impossible (assuming each table has no internal NAs).
For an outer report (showing what was DROPPED):
Reports experiments without metadata; useful for data-quality auditing.
Common pitfalls
Pitfall 1: silent row drops. inner_join silently removes rows without a match. Always check nrow(result) against expectations, especially in production pipelines.
Pitfall 2: row multiplication. If both tables have duplicate keys, inner_join multiplies rows (cartesian product per key). Use relationship = "one-to-one" or "many-to-one" to error instead.
inner_join is NOT the same as left_join + filter(!is.na(y_col)). They look similar but the filter approach is slower and may drop rows that legitimately had NA in y_col before the join.Try it yourself
Try it: Find cars in mtcars with cyl values present in a lookup table cyl_info. Save to ex_match.
Click to reveal solution
Explanation: inner_join drops cars whose cyl is not in cyl_info (cyl=6) and adds the label column.
Related dplyr functions
After mastering inner_join, look at:
left_join()/right_join()/full_join(): other mutating joinssemi_join(): same row-filtering as inner_join, no extra columnsanti_join(): opposite of semi_joinjoin_by(): dplyr 1.1+ key spec; supports inequalityintersect(): set intersection on whole rowsdata.table::merge(): very fast for big data
For very large data, the data.table package's joins are typically much faster.
FAQ
What does inner_join do in dplyr?
inner_join(x, y, by) returns rows where the by key matches in both x and y, with columns from both. Unmatched rows from either side are dropped.
What is the difference between inner_join and left_join?
inner_join keeps only rows that match in BOTH tables. left_join keeps ALL rows of x, putting NA in y's columns where no match. inner may return fewer rows; left always has at least as many as x.
What is the difference between inner_join and semi_join?
inner_join adds y's columns; semi_join does not. Use semi_join when you only need to filter, not augment.
How do I do a many-to-many inner join?
Pass relationship = "many-to-many" to suppress the warning. Result is the cartesian product per key.
Why did my inner_join return fewer rows than expected?
Because some rows in one table had keys that didn't appear in the other. Use anti_join(x, y) to inspect which rows of x were dropped.