dplyr join() Exercises: 10 Left, Right, Inner & Full Join Problems
Ten progressive exercises turn left_join(), right_join(), inner_join(), and full_join() into reflexes — each problem runs in your browser, with a worked solution you can reveal after you try.
Introduction
Joins look simple in theory. Then your real tables arrive with mismatched keys, different column names, and duplicate rows, and the join you wrote quietly returns the wrong answer. These ten exercises fix that faster than any tutorial.
You will solve ten problems against four tiny hand-built tables (employees, departments, salaries, projects). Each problem states the expected row count so you know when you are right. Difficulty climbs from a three-row inner join to debugging an unintended many-to-many explosion.
If the four mutating joins (left_join, right_join, inner_join, full_join) are new to you, skim the parent dplyr joins tutorial first. Otherwise, scroll to the setup block, click Run, and begin.
All code here runs in one shared R session. Variables from one block are available in the next. Use distinct names like my_result in your own attempts so you do not overwrite the tutorial tables.
Setup: The Datasets We Will Use
Before the exercises, we build four small tables. They are tiny on purpose — you can eyeball every row and spot join bugs by hand. Run this block once. Every later exercise assumes these objects exist.
The employees table has a dept_id key that points to departments. The salaries table keys on dept_id too. The projects table uses a different column name (employee_id) on purpose, to force you to handle mismatched keys.
Look closely at the keys. employees$dept_id contains 10, 20, 30. departments$dept_id contains 10, 20, 40. They overlap on {10, 20} only. Dept 30 (Dan) has no department row. Dept 40 (Legal) has no employees. These gaps drive every exercise below.
employees, departments, salaries, and projects. If you reset the R session, run Setup again before continuing.Warm-Up: Your First Joins (Exercises 1-3)
These three exercises use one join each. If you can get them right, you understand the core mechanics. Expected row counts are given — match them exactly.
Exercise 1: Attach department names (inner join)
Join employees to departments so every row has a dept_name. Drop any employee whose dept_id is missing from departments. Save to ans1. Expected: 4 rows (Dan is dropped because dept 30 is not in departments).
Click to reveal solution
Explanation: inner_join() keeps rows only when the key exists in both tables. Dan has dept_id = 30, which is absent from departments, so Dan is dropped. Dept 40 (Legal) has no employees and is also dropped. Four rows survive.
Exercise 2: Keep every employee (left join)
Same tables, same key. But this time keep every employee, even the ones without a department. Save to ans2. Expected: 5 rows (Dan stays, with dept_name = NA).
Click to reveal solution
Explanation: left_join() keeps every row from the left table (employees). When the right table has no match, the new columns fill with NA. This is the single most common join — use it whenever you want to enrich a primary table without losing rows.
Exercise 3: See what is missing on both sides (full join)
Join departments and salaries on dept_id. Show every department and every salary record, even orphans. Save to ans3. Expected: 4 rows (depts 10, 20 match; dept 40 is in departments only; dept 30 is in salaries only).
Click to reveal solution
Explanation: full_join() returns the union of keys from both tables. Legal (dept 40) has no salary record, so avg_salary is NA. Dept 30 has a salary but no department name. A full join is the join you use when you need to audit data — it shows gaps on both sides.
Core Problems: Join Types That Feel Alike (Exercises 4-7)
Warm-up done. These four exercises mix join types, chain joins together, and handle mismatched column names.
Exercise 4: Right join, and rewrite it as a left join
First: use right_join() to join employees to departments on dept_id. Save to ans4a. Expected: 5 rows (the 4 matched employees + Legal with NA name).
Then: rewrite the same result using left_join() instead, by swapping the arguments. Save to ans4b. The two should match.
Click to reveal solution
Explanation: right_join(A, B) and left_join(B, A) return the same rows — only the column order differs. Most R style guides recommend left_join() because it reads naturally: "start with this table, attach extras." Use right_join only when pipe order makes it clearer.
Exercise 5: Chain two joins
Build a single table that has every employee, their department name, and their department's average salary. Use two joins. Save to ans5. Expected: 5 rows (all employees; Dan has a salary but no dept name; Cara and Asha share dept 10).
Click to reveal solution
Explanation: Chaining joins is just piping the result of one join into the next. Because both joins share the dept_id key, you can write both by = "dept_id" the same way. Dan keeps his salary (dept 30 exists in salaries) but has no dept_name (dept 30 is missing from departments).
Exercise 6: Join when key columns have different names
employees uses emp_id. projects uses employee_id. Join them so every employee gets their project list. Keep every employee, even those with no project. Save to ans6. Expected: 6 rows (Asha has 2 projects; Cara has 1; Bo, Dan, Evi have NA project rows; the orphan project "Ghost" is dropped).
Click to reveal solution
Explanation: The by = c("left_col" = "right_col") syntax maps differently named keys. The result keeps the left table's column name (emp_id). Asha appears twice because she has two projects — this row expansion is normal and expected for one-to-many joins.
Exercise 7: Join on multiple keys
Two tiny tables are built inline below. sales_q has one row per (region, quarter). targets_q has the target for each (region, quarter). Inner join them on both keys. Save to ans7. Expected: 3 rows (three combinations match; two targets rows have no sales).
Click to reveal solution
Explanation: Pass a character vector to by = to join on multiple keys. Both columns must match simultaneously. Only three (region, quarter) pairs appear in both tables: (N, Q1), (N, Q2), (S, Q1). The rest are dropped by the inner join.
Advanced Challenges: Filter Joins and Debugging (Exercises 8-10)
The last three exercises move past basic joining. You will filter rows by membership in another table and debug joins that silently produce wrong answers.
Exercise 8: Filter employees by project membership
Use semi_join() to return only employees who have at least one project. Save to ans8a. Expected: 2 rows (Asha and Cara).
Then use anti_join() to return employees with no projects. Save to ans8b. Expected: 3 rows (Bo, Dan, Evi).
Click to reveal solution
Explanation: semi_join() and anti_join() are filtering joins. They return rows from the left table only — no columns from the right table are added. semi_join keeps rows that have a match; anti_join keeps rows that do not. Use them when you want to subset one table by membership in another without altering its columns.
Exercise 9: Spot the row explosion
The coupons table below has duplicated keys — the same customer_id appears several times. Join customers to coupons with a left join on customer_id and examine how many rows come back. Save to ans9. Expected: 5 rows, not 3, because customer 1 has three coupons.
Then answer for yourself: is this a bug or expected behaviour?
Click to reveal solution
Explanation: The result has 5 rows because Amy matches three coupons. This is expected for one-to-many joins and is exactly the behaviour you want when you need all coupons per customer. It becomes a bug only when you think you are doing a one-to-one join. If your join produces more rows than the left table had, check your right-side table for duplicated keys — that is the signal.
Exercise 10: Detect and fix an unintended many-to-many join
Two new tables below share the key product_id. Both sides have duplicates — orders has two rows for product 101, and prices has two rows for product 101 (two price tiers). A naive left_join multiplies them into a 4-row cross product for that one product.
Do the naive join first and observe the row count. Then fix it by deduplicating prices to keep only the latest price per product before joining. Save the fixed result to ans10. Expected after fix: 3 rows (one row per order).
Click to reveal solution
Explanation: The naive join has 4 rows because product 101 has 2 orders and 2 prices, so the join creates 2 x 2 = 4 rows for that product. Real order tables expect one price per order, not two. The fix: reduce the right side to one row per key before joining. slice_max(price_date, n = 1) keeps the most recent price per product. Since dplyr 1.1.0, an uncontrolled many-to-many join also warns you — that warning is a red flag, not noise.
Common Mistakes and How to Fix Them
Mistake 1: Omitting by = and hoping for the best
Leaving by = out lets dplyr guess keys by common column names. That works until your tables gain a shared column (updated_at, id) that was not meant as a join key.
Bad:
Good:
Always name the key. It is one extra argument and it stops silent bugs.
Mistake 2: Ignoring the many-to-many warning
Since dplyr 1.1.0, left_join() warns when the right side has duplicate keys and the match is not one-to-one. Many learners ignore it. If the row count jumps, the warning was pointing at a real bug — fix the right side before joining.
Mistake 3: Type mismatch on keys drops every match
If the left key is character ("101") and the right key is numeric (101), every join returns zero matches without error. Check key types before joining:
If they differ, coerce one side with as.character() or as.integer().
Mistake 4: Using inner_join() when you meant left_join()
inner_join() drops left-table rows that have no match. If you joined a primary table to a lookup and your row count fell, you probably wanted left_join(). Rule of thumb: **when enriching a primary table, default to left_join()**.
nrow(result) to the left table's row count. If the counts differ and you did not expect them to, stop and inspect before continuing.Summary
These four mutating joins are the workhorses. Commit the row-keeping rules to memory:
| Join | Keeps from Left | Keeps from Right | Typical Use |
|---|---|---|---|
inner_join() |
matches only | matches only | Strict — both sides required |
left_join() |
all rows | matches only | Enrich the primary table |
right_join() |
matches only | all rows | Rarely used; prefer left_join |
full_join() |
all rows | all rows | Audit gaps on both sides |
semi_join() |
matches only | none (filter) | "Keep rows that exist over there" |
anti_join() |
non-matches only | none (filter) | "Keep rows missing over there" |
FAQ
How is left_join() different from base R merge()?
merge() is slower, reorders rows by default, and renames duplicate columns with .x and .y suffixes. left_join() preserves the left table's row order, is faster on large tables, and matches the dplyr style. Both work; left_join() is idiomatic tidyverse.
Why does my join produce more rows than the left table had?
The right table has duplicate keys. Each duplicate creates a new row in the result. Check for duplicates on the right side with count(right_table, key) |> filter(n > 1). If the duplication is intentional (like Asha's two projects above), leave it; otherwise deduplicate with distinct() or slice_max() first.
When should I use semi_join() instead of inner_join() |> select()?
Use semi_join() when you only want to filter the left table by membership in the right table — you do not need any columns from the right table. It is clearer and avoids accidentally creating duplicates when the right side has many rows per key.
Can I join on more than two columns?
Yes. Pass a character vector: by = c("region", "quarter", "product_id"). All listed columns must match for a row to join. This is exactly what Exercise 7 demonstrated.
References
- dplyr mutating joins documentation. Link
- dplyr filtering joins documentation. Link
- Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G. — R for Data Science (2e), Chapter 19: Joins. Link
- Posit — dplyr 1.1.0: Joins announcement (introduces
join_by()and many-to-many warning). Link - STAT 545 — Join cheatsheet. Link
What's Next?
- R Joins: left, right, inner, full — the parent tutorial that walks through each mutating join with diagrams.
- dplyr group_by() & summarise() Exercises — ten more problems covering aggregation and per-group slicing.
- dplyr filter() and select() — upstream verbs you often use before joining.