R Joins Explained: inner_join, left_join, full_join — With Visual Diagrams
Joins combine two data frames by matching rows on shared columns. dplyr provides six join types: inner_join (only matches), left_join (all left rows), right_join (all right rows), full_join (everything), plus semi_join and anti_join for filtering.
Joins are how you combine information from different tables — employees with departments, orders with products, students with grades. The join type determines what happens to rows that don't have a match.
Setup: Two Example Tables
library(dplyr)
employees <- data.frame(
id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Carol", "David", "Eve"),
dept_id = c(10, 20, 10, 30, 20)
)
departments <- data.frame(
dept_id = c(10, 20, 40),
dept_name = c("Engineering", "Marketing", "Sales")
)
cat("Employees:\n"); print(employees)
cat("\nDepartments:\n"); print(departments)
cat("\nNote: David has dept_id=30 (no match). Sales has dept_id=40 (no employees).\n")
inner_join: Only Matching Rows
Keeps rows that have a match in BOTH tables. David (dept 30) and Sales (dept 40) are both dropped.
Keeps ALL rows from the left table. Unmatched right values become NA. This is the most common join type.
library(dplyr)
employees <- data.frame(id=1:5, name=c("Alice","Bob","Carol","David","Eve"), dept_id=c(10,20,10,30,20))
departments <- data.frame(dept_id=c(10,20,40), dept_name=c("Engineering","Marketing","Sales"))
left_join(employees, departments, by = "dept_id")
# David (dept 30) has NA for dept_name — no matching department
right_join: All Right Rows + Matches
library(dplyr)
employees <- data.frame(id=1:5, name=c("Alice","Bob","Carol","David","Eve"), dept_id=c(10,20,10,30,20))
departments <- data.frame(dept_id=c(10,20,40), dept_name=c("Engineering","Marketing","Sales"))
right_join(employees, departments, by = "dept_id")
# Sales (dept 40) appears with NA for id, name — no matching employee
full_join: Keep Everything
Keeps ALL rows from BOTH tables. Missing matches become NA on either side.
library(dplyr)
employees <- data.frame(id=1:5, name=c("Alice","Bob","Carol","David","Eve"), dept_id=c(10,20,10,30,20))
departments <- data.frame(dept_id=c(10,20,40), dept_name=c("Engineering","Marketing","Sales"))
full_join(employees, departments, by = "dept_id")
# Both David (no dept match) AND Sales (no employee match) kept
Join Type Summary
Join
Left rows kept
Right rows kept
Adds columns?
inner_join
Matched only
Matched only
Yes
left_join
All
Matched only
Yes
right_join
Matched only
All
Yes
full_join
All
All
Yes
semi_join
Matched only
—
No
anti_join
Unmatched only
—
No
semi_join and anti_join: Filtering Joins
These don't add columns — they filter the left table based on whether matches exist in the right table.
library(dplyr)
employees <- data.frame(id=1:5, name=c("Alice","Bob","Carol","David","Eve"), dept_id=c(10,20,10,30,20))
departments <- data.frame(dept_id=c(10,20,40), dept_name=c("Engineering","Marketing","Sales"))
# semi_join: employees who HAVE a matching department
cat("Employees with matching dept:\n")
semi_join(employees, departments, by = "dept_id")
library(dplyr)
employees <- data.frame(id=1:5, name=c("Alice","Bob","Carol","David","Eve"), dept_id=c(10,20,10,30,20))
departments <- data.frame(dept_id=c(10,20,40), dept_name=c("Engineering","Marketing","Sales"))
# anti_join: employees who DON'T have a matching department
cat("Employees without matching dept:\n")
anti_join(employees, departments, by = "dept_id")
Use semi_join when you want "give me rows from A that exist in B" without adding B's columns. Use anti_join for "give me rows from A that DON'T exist in B" — perfect for finding orphan records.
Joining on Multiple Keys
When a single column isn't enough to identify a match, join on multiple columns.
library(dplyr)
midterms <- data.frame(
student = c("Alice","Alice","Bob","Bob"),
subject = c("Math","English","Math","English"),
midterm = c(88, 92, 76, 81)
)
finals <- data.frame(
student = c("Alice","Alice","Bob"),
subject = c("Math","English","Math"),
final = c(91, 88, 82)
)
# Join on BOTH student AND subject
left_join(midterms, finals, by = c("student", "subject"))
# Bob's English final is NA — no match for that combination
Different Column Names
When the key columns have different names in each table, use a named vector.
library(dplyr)
orders <- data.frame(order_id = 1:3, customer_id = c(101, 102, 101))
customers <- data.frame(cust_id = c(101, 102, 103), name = c("Alice", "Bob", "Carol"))
# Key is customer_id in orders, cust_id in customers
left_join(orders, customers, by = c("customer_id" = "cust_id"))
Handling Column Name Conflicts
When both tables have a non-key column with the same name, dplyr adds suffixes.
library(dplyr)
df1 <- data.frame(id = 1:3, value = c(10, 20, 30))
df2 <- data.frame(id = 1:3, value = c(100, 200, 300))
left_join(df1, df2, by = "id", suffix = c("_old", "_new"))
Join customers with orders, then find customers who have never ordered.
library(dplyr)
customers <- data.frame(id = 1:5, name = c("Alice","Bob","Carol","David","Eve"))
orders <- data.frame(order_id = 1:4, cust_id = c(1,1,3,5), amount = c(50,30,75,20))
# 1. Show all customers with their orders (keep all customers)
# 2. Find customers who have zero orders
Click to reveal solution
```r
library(dplyr)
customers <- data.frame(id = 1:5, name = c("Alice","Bob","Carol","David","Eve"))
orders <- data.frame(order_id = 1:4, cust_id = c(1,1,3,5), amount = c(50,30,75,20))
cat("All customers with orders:\n")
left_join(customers, orders, by = c("id" = "cust_id"))
cat("\nCustomers with no orders:\n")
anti_join(customers, orders, by = c("id" = "cust_id"))
**Explanation:** `left_join` keeps all customers; those without orders have NA in order columns. `anti_join` returns only customers with no match in the orders table.
Exercise 2: Three-Table Join
Join students → enrollments → courses to get a complete picture.
library(dplyr)
students <- data.frame(student_id = 1:4, name = c("Alice","Bob","Carol","David"))
enrollments <- data.frame(student_id = c(1,1,2,3), course_id = c("CS101","MATH201","CS101","MATH201"))
courses <- data.frame(course_id = c("CS101","MATH201","ENG101"), title = c("Intro CS","Calculus","English"))
# Chain two joins to get: student name + course title
Click to reveal solution
```r
library(dplyr)
students <- data.frame(student_id = 1:4, name = c("Alice","Bob","Carol","David"))
enrollments <- data.frame(student_id = c(1,1,2,3), course_id = c("CS101","MATH201","CS101","MATH201"))
courses <- data.frame(course_id = c("CS101","MATH201","ENG101"), title = c("Intro CS","Calculus","English"))
students |>
left_join(enrollments, by = "student_id") |>
left_join(courses, by = "course_id") |>
select(name, title)
**Explanation:** Chain joins with the pipe. David has no enrollment → NA for course_id and title. ENG101 has no students → doesn't appear (it's not in the left tables).
Exercise 3: Find Mismatches Both Ways
Find employees without valid departments AND departments without any employees.
library(dplyr)
emps <- data.frame(name = c("Alice","Bob","Carol","David"), dept_id = c(1,2,1,4))
depts <- data.frame(dept_id = c(1,2,3), dept_name = c("Eng","Mkt","Sales"))
cat("Employees without valid dept:\n")
anti_join(emps, depts, by = "dept_id")
cat("\nDepartments with no employees:\n")
anti_join(depts, emps, by = "dept_id")
**Explanation:** `anti_join(A, B)` returns rows in A with no match in B. Swap the arguments to find the opposite direction's mismatches.
Summary
Join
Keeps
Use case
inner_join(a, b)
Only matching rows
"Give me rows that exist in both"
left_join(a, b)
All from a + matches from b
"Keep all of a, add b's info" (most common)
right_join(a, b)
Matches from a + all from b
Same as left_join with arguments swapped
full_join(a, b)
Everything from both
"Keep everything, fill gaps with NA"
semi_join(a, b)
a rows that match b
"Filter a to things that exist in b"
anti_join(a, b)
a rows that DON'T match b
"Find orphans — what's in a but not b"
FAQ
What happens with duplicate keys?
If a key matches multiple rows in the other table, you get all combinations. For a one-to-many join (one customer, many orders), this is correct — each order gets the customer info. For accidental many-to-many joins, rows multiply unexpectedly.
Should I use dplyr joins or merge()?
Use dplyr joins. They preserve row order, offer semi/anti joins, produce clear error messages, and work naturally with the pipe. Use merge() only when dplyr isn't available.
What if the key columns have different names?
Use a named vector: left_join(a, b, by = c("emp_id" = "employee_id")). The left side is the column name in a, the right side in b.
How do I join on ALL shared column names?
Omit by = — dplyr will auto-detect shared column names and join on all of them. But being explicit is safer to avoid accidental joins on unintended columns.