dplyr Joins in R: left_join, inner_join, full_join
dplyr joins combine two data frames on one or more matching key columns. Use left_join() to keep all rows of the first; inner_join() to keep only matched rows; full_join() to keep everything; semi_join() and anti_join() as filtering joins.
left_join(x, y, by = "id") # keep all rows of x right_join(x, y, by = "id") # keep all rows of y inner_join(x, y, by = "id") # keep only matched rows full_join(x, y, by = "id") # keep all from both semi_join(x, y, by = "id") # x rows that match (no y cols) anti_join(x, y, by = "id") # x rows with NO match left_join(x, y, by = c("id", "date")) # multi-column join
Need explanation? Read on for examples and pitfalls.
What dplyr joins do in one sentence
A dplyr join combines two data frames by matching values in one or more key columns. The six join verbs differ in WHICH rows they keep: left_join() keeps all rows of the first table, right_join() keeps all of the second, inner_join() keeps only rows present in both, full_join() keeps everything, and semi_join()/anti_join() are filtering joins that return only rows from the first table without adding columns from the second.
Unlike base R's merge(), dplyr joins make the join type explicit in the function name, never accidentally turn factors into characters, and integrate cleanly into pipelines.
Syntax
Every join takes two data frames x and y plus a by argument that names the matching keys. If columns share the same name in both tables, dplyr can auto-detect them, but always passing by explicitly is clearer.
The full signature for left_join() (others are similar):
left_join(x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"),
..., keep = FALSE, na_matches = "na", multiple = "all",
unmatched = "drop", relationship = NULL)
Most useful arguments: by for keys, suffix for resolving column-name conflicts, relationship for asserting one-to-one or many-to-one expectations.
by explicitly, even when keys share names. left_join(x, y, by = "id") is more readable than left_join(x, y). The latter prints "Joining with by = join_by(id)" warnings that clutter logs and miss when the auto-detected keys are wrong.Seven common patterns
1. left_join: keep all rows of x
Freddie has no instrument in the second table, so instrument is NA. left_join keeps every row of the left table regardless of match.
2. inner_join: only matched rows
Freddie is dropped because he is not in the second table. Brian is dropped because he is not in the first.
3. full_join: keep all rows from both
Brian appears with band = NA because he is not in the first table. Freddie has instrument = NA because he is not in the second.
4. semi_join: filter, do not add columns
semi_join() returns rows of x that match in y, but does NOT bring over columns from y. Use it as a presence filter.
5. anti_join: rows in x with NO match in y
anti_join() is the natural way to find "rows in x that are missing in y". Useful for reconciliation and validation.
6. Joining on multiple columns
Pass a character vector to by to match on multiple columns. Both keys must match for a row to join.
7. Joining when key columns have different names
Pass a named vector: c("name_in_x" = "name_in_y"). The result keeps only the left table's name for the joined column.
left_join keeps all left, right_join keeps all right, inner_join keeps only the intersection, full_join keeps everything. semi_join and anti_join have no SQL one-word equivalent; they are filtering joins, returning only x's columns.dplyr joins vs base R merge()
Base R has one merge() function with all.x, all.y, all arguments to select join type. dplyr uses six explicit verbs. Explicit beats positional: left_join(x, y) is clearer than merge(x, y, all.x = TRUE).
| Task | dplyr | Base R |
|---|---|---|
| Left join | left_join(x, y, by = "id") |
merge(x, y, by = "id", all.x = TRUE) |
| Inner join | inner_join(x, y, by = "id") |
merge(x, y, by = "id") |
| Full join | full_join(x, y, by = "id") |
merge(x, y, by = "id", all = TRUE) |
| Semi join | semi_join(x, y, by = "id") |
x[x$id %in% y$id, ] |
| Anti join | anti_join(x, y, by = "id") |
x[!x$id %in% y$id, ] |
| Different key names | left_join(x, y, by = c("xid"="yid")) |
merge(x, y, by.x="xid", by.y="yid", all.x=TRUE) |
When to use which:
- Use dplyr joins inside any pipeline.
- Use base
merge()only when you want zero package dependencies.
Common pitfalls
Pitfall 1: row duplication from many-to-many joins. If both x and y have multiple matches for a key, the result is the cross product. left_join(x, y) where x has 100 rows for id=1 and y has 5 rows for id=1 returns 500 rows for id=1. If you do not expect this, set relationship = "one-to-many" and dplyr will error if your assumption is violated.
Pitfall 2: column name conflicts get suffix.x and suffix.y. If both tables have a column named "value" (besides the key), the result has "value.x" and "value.y". To control naming, pass suffix = c("_left", "_right") or rename one table's column before joining.
id = NA in different tables WILL join. This is na_matches = "na" (the default). To treat NAs as non-matching (SQL convention), set na_matches = "never". For most data this default is fine; for SQL-style strict behavior, override.Pitfall 3: forgetting to dedupe before joining. If x has duplicate keys you did not expect, the join silently produces duplicates. Always inspect x |> count(key) |> filter(n > 1) before a join if you suspect duplication. Or use relationship = "one-to-one" to assert and fail loudly.
Try it yourself
Try it: Given the two tibbles below, perform a left_join on id so every row of customers is kept and orders data is added where it matches. Save the result to ex_joined.
Click to reveal solution
Explanation: left_join() keeps every row of x (customers) and brings in matching columns from y (orders). Cara has no order, so her amount is NA. The by = "id" argument tells dplyr which column to match on.
Related dplyr functions
After mastering dplyr joins, look at:
bind_rows(),bind_cols(): stack tables vertically or horizontally without keyscross_join(): explicit cross product (Cartesian)nest_join(): each x row gets a list-column of matching y rowsjoin_by(): rich syntax for inequality joins, rolling joins, and overlap joins (dplyr 1.1+)setdiff(),union(),intersect(): set operations on whole rows
For very large data, also check data.table joins (X[Y, on = "id"] syntax), often faster.
FAQ
What is the difference between left_join and inner_join in dplyr?
left_join(x, y) keeps every row of x, filling NA where there is no match in y. inner_join(x, y) keeps only rows that match in both. Use left_join when x is your primary table and you want to enrich it with optional info from y. Use inner_join when you want only fully-matched records.
How do I join two data frames in R with dplyr?
Pick a join function based on which rows you want to keep, then call it with by = "key_column": left_join(x, y, by = "id"). For multi-column keys, pass a character vector: by = c("a", "b"). For different column names: by = c("x_col" = "y_col").
What is the difference between semi_join and inner_join?
inner_join(x, y) returns matched rows AND brings over columns from y. semi_join(x, y) returns the same rows but does NOT bring y's columns. Use semi_join when you only need to filter x by membership in y, not to add y's data.
How do I do a full outer join in dplyr?
Use full_join(x, y, by = "id"). Every row from both tables appears; unmatched cells become NA. This is the equivalent of SQL FULL OUTER JOIN.
Why do I have duplicate rows after a dplyr join?
The join key has multiple matches in one or both tables. The result is the cross product of matches. To diagnose: x |> count(key) |> filter(n > 1) and same for y. To prevent: deduplicate before joining or use distinct(). To assert: pass relationship = "one-to-many" (dplyr 1.1+) and dplyr errors if your assumption is wrong.