dplyr left_join() in R: Keep All Rows From the Left Table

The left_join() function in dplyr merges two tables by key, KEEPING all rows from the left table and adding matching columns from the right. Unmatched rows get NA in the new columns.

⚡ Quick Answer
left_join(x, y, by = "id")                 # match by id
left_join(x, y, by = c("id", "date"))      # multi-column key
left_join(x, y, by = c("id" = "user_id"))  # different column names
left_join(x, y, by = join_by(id))          # tidy-style key spec
inner_join(x, y, by = "id")                # only matching rows
right_join(x, y, by = "id")                # all rows from RIGHT

Need explanation? Read on for examples and pitfalls.

📊 Is left_join() the right tool?
STARTkeep ALL rows from left, add matches from rightleft_join()keep ONLY matched rowsinner_join()keep ALL from rightright_join()keep ALL from BOTHfull_join()keep left rows that DO matchsemi_join()keep left rows that do NOT matchanti_join()combine without key (by position)bind_cols()

What left_join() does in one sentence

left_join(x, y, by) returns a data frame with EVERY row of x, plus matching columns from y joined on the by key; rows in x with no match get NA in y's columns. This is the most common merge in data analysis: "augment my main table with lookup info".

The "left" preserves the left table's row count regardless of matches in y.

Syntax

left_join(x, y, by = NULL, suffix = c(".x", ".y")). by is a character vector or named vector; auto-detected if NULL (matches all common columns, with a message).

Run live
Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.
RAugment customer table with order count
library(dplyr) customers <- data.frame(id = 1:4, name = c("a","b","c","d")) orders <- data.frame(id = c(1, 1, 3, 5), amount = c(10, 20, 30, 40)) left_join(customers, orders, by = "id") #> id name amount #> 1 1 a 10 #> 2 1 a 20 <-- duplicates because customer 1 has 2 orders #> 3 2 b NA <-- no orders #> 4 3 c 30 #> 5 4 d NA

  
Tip
Always specify by explicitly. Auto-detection works but emits a message and can match unintended columns. by = "id" is unambiguous.

Five common patterns

1. Single-column key

RMatch by id
left_join(customers, orders, by = "id")

  

2. Multi-column key

RMatch by composite key
sales <- data.frame( region = c("NA","NA","EU"), product = c("X","Y","X"), qty = c(100, 200, 50) ) catalog <- data.frame( region = c("NA","EU"), product = c("X","X"), price = c(10, 12) ) left_join(sales, catalog, by = c("region","product")) #> region product qty price #> 1 NA X 100 10 #> 2 NA Y 200 NA <-- no Y in catalog #> 3 EU X 50 12

  

3. Differently-named keys

RMap customer.id to order.user_id
customers <- data.frame(id = 1:3, name = c("a","b","c")) orders <- data.frame(user_id = c(1, 1, 2), amount = c(10, 20, 30)) left_join(customers, orders, by = c("id" = "user_id")) #> id name amount #> 1 1 a 10 #> 2 1 a 20 #> 3 2 b 30 #> 4 3 c NA

  

4. Modern join_by syntax (dplyr 1.1+)

RSame as above, modern syntax
left_join(customers, orders, by = join_by(id == user_id))

  

join_by() supports inequality and rolling joins (helpful for time-series).

5. Many-to-many warning

RWatch for unintended row multiplication
df_a <- data.frame(id = c(1, 1), x = c("a","b")) df_b <- data.frame(id = c(1, 1), y = c("p","q")) left_join(df_a, df_b, by = "id", relationship = "many-to-many") #> id x y #> 1 1 a p #> 2 1 a q #> 3 1 b p #> 4 1 b q

  

Each combination of left and right rows for the same id is produced. Specifying relationship = "one-to-one" (or "many-to-one") errors instead, helping catch surprises.

Key Insight
left_join PRESERVES the LEFT table's row count UNLESS keys are duplicated in the right table. If the right table has multiple matching rows per left key, the left rows are DUPLICATED. This is why left_join can grow your data unexpectedly.

left_join() vs inner_join() vs right_join() vs full_join()

The four mutating-join family in dplyr.

Function Keeps
left_join(x, y) All rows of x; adds y's matches
inner_join(x, y) Only rows in BOTH x and y
right_join(x, y) All rows of y; adds x's matches
full_join(x, y) All rows of x and y; NA for unmatched
semi_join(x, y) Rows of x that have a match in y (no y columns)
anti_join(x, y) Rows of x that have NO match in y

When to use which:

  • left_join for "augment my main table".
  • inner_join for "where both have data".
  • right_join rarely; usually flip and left_join.
  • full_join for outer joins (rarely needed).
  • semi_join / anti_join for filter-without-augment.

A practical workflow

The "augment with lookup" pattern is left_join's most common use.

RStar schema augmentation
sales |> left_join(products, by = "product_id") |> left_join(regions, by = "region_id") |> left_join(dates, by = "date")

  

Successive left_joins enrich the sales fact table with dimension data. Each join keeps all sales rows; missing matches show NA.

For point-in-time joins (e.g., "match each event to the most recent price"), use dplyr::join_by(event_date >= price_date) (1.1+).

Common pitfalls

Pitfall 1: row count multiplication. If the right table has duplicate keys, left_join multiplies left rows. Always check nrow(result) == nrow(x) after a join.

Pitfall 2: column name collision. When both tables have value, the result has value.x and value.y. Use suffix to customize, or rename before joining.

Warning
left_join with auto-by is convenient but RISKY. It joins on ALL common column names, which may include columns you didn't intend. Always specify by explicitly in production code.

Try it yourself

Try it: Augment mtcars with an efficiency label by joining a lookup table mapping cyl to label. Save to ex_aug.

RYour turn: augment mtcars with cyl labels
cyl_labels <- data.frame(cyl = c(4, 6, 8), label = c("eco","balanced","power")) ex_aug <- mtcars |> # your code here head(ex_aug[, c("mpg","cyl","label")]) #> Expected: original + label column

  
Click to reveal solution
RSolution
ex_aug <- mtcars |> left_join(cyl_labels, by = "cyl") head(ex_aug[, c("mpg","cyl","label")]) #> mpg cyl label #> Mazda RX4 21.0 6 balanced #> Mazda RX4 Wag 21.0 6 balanced #> Datsun 710 22.8 4 eco #> Hornet 4 Drive 21.4 6 balanced #> Hornet Sportabout 18.7 8 power #> Valiant 18.1 6 balanced

  

Explanation: left_join(cyl_labels, by = "cyl") augments each car with its cyl-group label. All 32 mtcars rows preserved.

After mastering left_join, look at:

  • inner_join() / right_join() / full_join(): other join types
  • semi_join() / anti_join(): filtering joins
  • join_by(): dplyr 1.1+ key spec, supports inequality
  • bind_cols(): position-based combine (no key)
  • merge(): base R alternative
  • data.table::merge(): very fast for large data

For very large tables, the data.table package's joins are often 10x faster than dplyr's.

FAQ

What does left_join do in dplyr?

left_join(x, y, by) keeps every row of x and adds matching columns from y. Rows in x without a match get NA in y's columns.

What is the difference between left_join and inner_join?

left_join keeps ALL rows of x. inner_join keeps only rows that match in BOTH x and y. inner_join may return fewer rows; left_join never drops left rows.

How do I join on multiple columns?

Pass a character vector: by = c("region", "product"). Both must match for rows to merge.

How do I join on differently-named columns?

Use a named vector: by = c("id" = "user_id"). The left side's id matches the right side's user_id.

Why did my left_join return more rows than the original?

Because the right table has duplicate keys. left_join multiplies left rows when right has multiple matches. Use relationship = "one-to-one" to error instead, or deduplicate the right table first.