dplyr semi_join() in R: Filter to Rows With Match in Right
The semi_join() function in dplyr keeps rows from the left table that have a MATCH in the right table, WITHOUT adding any of right's columns. It is the filtering join: subset by membership.
semi_join(x, y, by = "id") # x rows that match y semi_join(x, y, by = c("id","date")) # multi-key inner_join(x, y, by = "id") # different: ALSO adds y's columns anti_join(x, y, by = "id") # opposite: x rows that DON'T match filter(x, id %in% y$id) # equivalent for one-col key
Need explanation? Read on for examples and pitfalls.
What semi_join() does in one sentence
semi_join(x, y, by) returns rows of x whose by key appears at least once in y, with NO columns added from y. It is a filter: same column count as x, possibly fewer rows.
semi_join is used when you want to filter without changing the data shape. inner_join also filters but adds y's columns; semi_join doesn't.
Syntax
semi_join(x, y, by = NULL). Same key arguments as inner_join. No suffix (no column name conflicts).
semi_join deduplicates the right table for matching purposes. Customer id 1 has 2 orders, but only one row appears in the result. semi_join is row-CARDINALITY preserving on the left.Five common patterns
1. Filter by membership
Cleaner than filter(customers, id %in% unique(orders$id)).
2. Multi-column key
3. Inverse membership filter
4. Cardinality-preserving (no duplication)
This is the key advantage of semi_join over inner_join for filtering tasks.
5. Filter with mapped keys
semi_join is the ONLY join that filters without ever multiplying rows. Each row of x appears at most once. inner_join can multiply (each match in y becomes a separate row). Use semi_join when you only care about membership, not the join data itself.semi_join() vs inner_join() vs anti_join() vs filter
Four ways to filter x by relationship to y.
| Approach | Keeps | Adds y cols | Multiplies rows |
|---|---|---|---|
semi_join(x, y) |
x rows that match | No | No |
inner_join(x, y) |
x rows that match | Yes | Yes (if y has duplicates) |
anti_join(x, y) |
x rows that DON'T match | No | No |
filter(x, col %in% y$col) |
Same as semi for 1 col | No | No |
When to use which:
semi_joinfor clean filter, multi-key, no row duplication.inner_joinwhen you also want y's columns.anti_joinfor the opposite filter.filter(... %in% ...)for simple single-column tests in scripts.
A practical workflow
The "filter to known-good keys" pattern is semi_join's main use case.
Keeps only transactions where user_id is in approved_users. No columns added; no row multiplication. Cleaner than filter(user_id %in% approved_users$user_id) for multi-column keys.
For data quality:
Split data by membership in a master list.
Common pitfalls
Pitfall 1: confusing semi_join with inner_join. semi_join filters without adding y's columns. inner_join filters AND adds. Pick based on whether you need the y data downstream.
Pitfall 2: forgetting that semi_join deduplicates the y side automatically. If y has multiple matching rows per x key, the result is still ONE row per x. inner_join would produce multiple rows.
semi_join errors if the by columns don't exist in both tables. Same as other joins. Make sure the column names match (or use named by for differently-named columns).Try it yourself
Try it: From mtcars, keep only cars with cyl values listed in a separate vector valid_cyl. Save to ex_filtered.
Click to reveal solution
Explanation: semi_join keeps rows whose cyl is in valid_cyl. cyl=8 cars are dropped.
Related dplyr functions
After mastering semi_join, look at:
anti_join(): opposite filtering joininner_join(): filter + add columnsfilter(): scalar membership testsintersect(): set intersection on whole rowsmatch(): base R; index lookups%in%: base R; one-column membership
For one-column key membership, filter(x, col %in% y$col) is shorter and equally clear.
FAQ
What does semi_join do in dplyr?
semi_join(x, y, by) keeps rows of x whose by key appears in y, WITHOUT adding any of y's columns. It is a filtering join.
What is the difference between semi_join and inner_join?
semi_join filters without adding columns. inner_join also filters but ADDS y's columns. semi_join never multiplies rows; inner_join can if y has duplicate keys.
Why use semi_join instead of filter(... %in% ...)?
For multi-column keys: semi_join(x, y, by = c("region","product")) is clean. filter(x, region %in% y$region & product %in% y$product) is wrong (it tests each column independently). semi_join handles tuple membership.
Does semi_join multiply rows?
No. semi_join keeps each x row at most once, regardless of how many matches exist in y. inner_join would multiply.
What is the inverse of semi_join?
anti_join(x, y, by) keeps x rows that do NOT have a match in y. Together they partition x by membership in y.