dplyr left_join() in R: Keep All Rows From the Left Table
The left_join() function in dplyr merges two tables by key, KEEPING all rows from the left table and adding matching columns from the right. Unmatched rows get NA in the new columns.
left_join(x, y, by = "id") # match by id left_join(x, y, by = c("id", "date")) # multi-column key left_join(x, y, by = c("id" = "user_id")) # different column names left_join(x, y, by = join_by(id)) # tidy-style key spec inner_join(x, y, by = "id") # only matching rows right_join(x, y, by = "id") # all rows from RIGHT
Need explanation? Read on for examples and pitfalls.
What left_join() does in one sentence
left_join(x, y, by) returns a data frame with EVERY row of x, plus matching columns from y joined on the by key; rows in x with no match get NA in y's columns. This is the most common merge in data analysis: "augment my main table with lookup info".
The "left" preserves the left table's row count regardless of matches in y.
Syntax
left_join(x, y, by = NULL, suffix = c(".x", ".y")). by is a character vector or named vector; auto-detected if NULL (matches all common columns, with a message).
by explicitly. Auto-detection works but emits a message and can match unintended columns. by = "id" is unambiguous.Five common patterns
1. Single-column key
2. Multi-column key
3. Differently-named keys
4. Modern join_by syntax (dplyr 1.1+)
join_by() supports inequality and rolling joins (helpful for time-series).
5. Many-to-many warning
Each combination of left and right rows for the same id is produced. Specifying relationship = "one-to-one" (or "many-to-one") errors instead, helping catch surprises.
left_join PRESERVES the LEFT table's row count UNLESS keys are duplicated in the right table. If the right table has multiple matching rows per left key, the left rows are DUPLICATED. This is why left_join can grow your data unexpectedly.left_join() vs inner_join() vs right_join() vs full_join()
The four mutating-join family in dplyr.
| Function | Keeps |
|---|---|
left_join(x, y) |
All rows of x; adds y's matches |
inner_join(x, y) |
Only rows in BOTH x and y |
right_join(x, y) |
All rows of y; adds x's matches |
full_join(x, y) |
All rows of x and y; NA for unmatched |
semi_join(x, y) |
Rows of x that have a match in y (no y columns) |
anti_join(x, y) |
Rows of x that have NO match in y |
When to use which:
left_joinfor "augment my main table".inner_joinfor "where both have data".right_joinrarely; usually flip and left_join.full_joinfor outer joins (rarely needed).semi_join/anti_joinfor filter-without-augment.
A practical workflow
The "augment with lookup" pattern is left_join's most common use.
Successive left_joins enrich the sales fact table with dimension data. Each join keeps all sales rows; missing matches show NA.
For point-in-time joins (e.g., "match each event to the most recent price"), use dplyr::join_by(event_date >= price_date) (1.1+).
Common pitfalls
Pitfall 1: row count multiplication. If the right table has duplicate keys, left_join multiplies left rows. Always check nrow(result) == nrow(x) after a join.
Pitfall 2: column name collision. When both tables have value, the result has value.x and value.y. Use suffix to customize, or rename before joining.
left_join with auto-by is convenient but RISKY. It joins on ALL common column names, which may include columns you didn't intend. Always specify by explicitly in production code.Try it yourself
Try it: Augment mtcars with an efficiency label by joining a lookup table mapping cyl to label. Save to ex_aug.
Click to reveal solution
Explanation: left_join(cyl_labels, by = "cyl") augments each car with its cyl-group label. All 32 mtcars rows preserved.
Related dplyr functions
After mastering left_join, look at:
inner_join()/right_join()/full_join(): other join typessemi_join()/anti_join(): filtering joinsjoin_by(): dplyr 1.1+ key spec, supports inequalitybind_cols(): position-based combine (no key)merge(): base R alternativedata.table::merge(): very fast for large data
For very large tables, the data.table package's joins are often 10x faster than dplyr's.
FAQ
What does left_join do in dplyr?
left_join(x, y, by) keeps every row of x and adds matching columns from y. Rows in x without a match get NA in y's columns.
What is the difference between left_join and inner_join?
left_join keeps ALL rows of x. inner_join keeps only rows that match in BOTH x and y. inner_join may return fewer rows; left_join never drops left rows.
How do I join on multiple columns?
Pass a character vector: by = c("region", "product"). Both must match for rows to merge.
How do I join on differently-named columns?
Use a named vector: by = c("id" = "user_id"). The left side's id matches the right side's user_id.
Why did my left_join return more rows than the original?
Because the right table has duplicate keys. left_join multiplies left rows when right has multiple matches. Use relationship = "one-to-one" to error instead, or deduplicate the right table first.