dplyr join() Exercises: 10 Left, Right, Inner & Full Join Problems
Ten progressive exercises turn left_join(), right_join(), inner_join(), and full_join() into reflexes. Every problem runs in your browser, states the expected row count so you know when you are right, and hides a worked solution you can reveal after trying.
What datasets will you use for these exercises?
Every problem on this page reuses four tiny hand-built tables: employees, departments, salaries, and projects. They are deliberately small so you can eyeball every row and spot join bugs by hand. The keys are picked to expose the situations that trip learners up, employees with no department, departments with no employees, and a table that uses a different column name for its foreign key. Run the setup block once, then work through the problems in order.
Look closely at the keys before you touch any join. The employees$dept_id column contains 10, 20, 30. The departments$dept_id column contains 10, 20, 40. They overlap on just {10, 20}. Dept 30 (Dan) has no department row. Dept 40 (Legal) has no employees. These gaps drive every exercise below.
employees, departments, salaries, and projects. If your WebR session resets, come back and re-run Setup before continuing.Warm-Up, Your First Joins (Exercises 1–3)
These three exercises use one join each. If you get them right, you understand the core mechanics. Match the expected row counts exactly.
Exercise 1: Attach department names (inner join)
Join employees to departments so every row has a dept_name. Drop any employee whose dept_id is missing from departments. Save to ans1. Expected: 4 rows (Dan is dropped because dept 30 is not in departments).
Click to reveal solution
Explanation: inner_join() keeps rows only when the key exists in both tables. Dan has dept_id = 30, which is absent from departments, so he is dropped. Dept 40 (Legal) has no employees and is also dropped. Four rows survive.
Exercise 2: Keep every employee (left join)
Same tables, same key. This time keep every employee, even those without a department. Save to ans2. Expected: 5 rows (Dan stays, with dept_name = NA).
Click to reveal solution
Explanation: left_join() keeps every row from the left table (employees). When the right table has no match, the new columns fill with NA. This is the single most common join, use it whenever you want to enrich a primary table without losing rows.
Exercise 3: See what's missing on both sides (full join)
Join departments and salaries on dept_id. Show every department and every salary record, even orphans. Save to ans3. Expected: 4 rows (depts 10 and 20 match; dept 40 is in departments only; dept 30 is in salaries only).
Click to reveal solution
Explanation: full_join() returns the union of keys from both tables. Legal (dept 40) has no salary record, so avg_salary is NA. Dept 30 has a salary but no department name. Full join is the join you reach for when you need to audit data, it shows gaps on both sides.
Core Problems, Join Types That Feel Alike (Exercises 4–7)
Warm-up done. These four exercises mix join types, chain joins together, and handle keys whose column names do not match.
Exercise 4: Right join, then rewrite it as a left join
First, use right_join() to join employees to departments on dept_id. Save to ans4a. Expected: 5 rows (the 4 matched employees plus Legal with NA name).
Then rewrite the same result using left_join() by swapping the arguments. Save to ans4b. The two should contain the same rows.
Click to reveal solution
Explanation: right_join(A, B) and left_join(B, A) return the same rows, only the column order differs. Most R style guides recommend left_join() because it reads naturally: "start with this table, attach extras." Reach for right_join() only when pipe order makes the intent clearer.
Exercise 5: Chain two joins into one pipeline
Build a single table that has every employee, their department name, and their department's average salary. Use two joins. Save to ans5. Expected: 5 rows (Dan has a salary but no dept name; Cara and Asha share dept 10).
Click to reveal solution
Explanation: Chaining joins is just piping the result of one join into the next. Both joins share the dept_id key, so each by = looks identical. Dan keeps his salary (dept 30 exists in salaries) but has no dept_name (dept 30 is missing from departments).
Exercise 6: Join when key columns have different names
The employees table uses emp_id. The projects table uses employee_id. Join them so every employee gets their project list. Keep every employee, even those with no project. Save to ans6. Expected: 6 rows (Asha has 2 projects; Cara has 1; Bo, Dan, Evi have NA project rows; the orphan project "Ghost" is dropped).
Click to reveal solution
Explanation: The by = c("left_col" = "right_col") syntax maps differently named keys. The result keeps the left table's column name (emp_id). Asha appears twice because she has two projects, this row expansion is expected for one-to-many joins, not a bug.
join_by() is the modern alternative to by = strings. You can write by = join_by(emp_id == employee_id) for clearer intent and support for inequality joins. The string form still works everywhere and is fine for simple equality matches.Exercise 7: Join on multiple keys
Two tiny tables are built inline below. sales_q has one row per (region, quarter). targets_q has the target for each (region, quarter). Inner join them on both keys. Save to ans7. Expected: 3 rows (three combinations match; two targets rows have no matching sales).
Click to reveal solution
Explanation: Pass a character vector to by = to join on multiple keys. Both columns must match simultaneously. Only three (region, quarter) pairs appear in both tables: (N, Q1), (N, Q2), (S, Q1). The rest are dropped by the inner join.
Advanced Challenges, Filter Joins and Debugging (Exercises 8–10)
The last three exercises move past basic joining. You will filter rows by membership in another table, and debug joins that silently produce wrong answers.
Exercise 8: Filter employees by project membership
Use semi_join() to return only employees who have at least one project. Save to ans8a. Expected: 2 rows (Asha and Cara).
Then use anti_join() to return employees with no projects. Save to ans8b. Expected: 3 rows (Bo, Dan, Evi).
Click to reveal solution
Explanation: semi_join() and anti_join() are filtering joins. They return rows from the left table only, no columns from the right table are added. semi_join() keeps rows that have a match; anti_join() keeps rows that do not. Reach for them when you want to subset one table by membership in another without altering its columns.
semi_join() on duplicated keys in the right table still returns the left table's row count, not an inflated one.Exercise 9: Spot the row explosion
The coupons table below has duplicated keys, the same customer_id appears several times. Left-join customers to coupons on customer_id and examine how many rows come back. Save to ans9. Expected: 5 rows, not 3, because customer 1 has three coupons.
Then ask yourself: is this a bug or expected behaviour?
Click to reveal solution
Explanation: The result has 5 rows because Amy matches three coupons. This is expected for one-to-many joins and is exactly what you want when you need all coupons per customer. It becomes a bug only when you thought you were doing a one-to-one join. Rule of thumb: if the result has more rows than the left table did, inspect the right side for duplicated keys before trusting the output.
Exercise 10: Detect and fix an unintended many-to-many join
Two new tables below share the key product_id. Both sides have duplicates, orders has two rows for product 101, and prices has two rows for product 101 (two price tiers). A naive left join multiplies them into a 4-row cross product for that one product.
Do the naive join first and observe the row count. Then fix it by deduplicating prices to keep only the latest price per product, and join again. Save the fixed result to ans10. Expected after the fix: 3 rows, one per order.
Click to reveal solution
Explanation: The naive join has 4 rows because product 101 has 2 orders and 2 prices, so the join creates 2 × 2 = 4 rows for that product. Real order tables expect one price per order, not two. The fix reduces the right side to one row per key before joining, slice_max(price_date, n = 1) keeps the most recent price per product. Since dplyr 1.1.0, an uncontrolled many-to-many join also emits a warning; that warning is a red flag, not noise.
Detected an unexpected many-to-many relationship, stop and inspect, the join is almost certainly producing wrong row counts. Silence the warning only after you have verified the duplication is intentional by setting relationship = "many-to-many".How can you avoid the most common join bugs?
Four mistakes catch almost every dplyr learner. Each one has a simple fix.
Mistake 1: Omitting by = and hoping for the best
Leaving by = out lets dplyr guess keys by common column names. That works today, until a future table gains a shared column (updated_at, id) that was not meant as a join key.
Always name the key. It is one extra argument, and it stops silent bugs.
Mistake 2: Ignoring the many-to-many warning
Since dplyr 1.1.0, left_join() warns when the right side has duplicate keys and the match is not one-to-one. Many learners ignore it. If the row count jumps, that warning was pointing at a real bug, fix the right side before the join.
Mistake 3: Type mismatch on keys drops every match
If the left key is character ("101") and the right key is numeric (101), every join returns zero matches without an error. Check key types before joining:
If they differ, coerce one side with as.character() or as.integer().
Mistake 4: Using inner_join() when you meant left_join()
inner_join() drops left-table rows that have no match. If you joined a primary table to a lookup and the row count fell, you probably wanted left_join(). Default to left_join() when enriching a primary table, use inner_join() only when missing matches truly mean "exclude this row."
nrow(result) to the left table's row count. If the counts differ and you did not expect them to, stop and inspect before continuing.Putting It All Together: A Full Employee Report
Now combine what you practised into one pipeline. The goal is a single report that, for every employee, shows their department name, department salary, and the list of projects they are on. Missing data should show "—" instead of NA for readability.
This pipeline leans on every idea from the exercises above. The two left_join() calls keep every employee regardless of missing lookups. The inline group_by() |> summarise() collapses multiple projects per employee into a single string before joining, this is the defensive version of Exercise 10, preventing a row explosion up front. The coalesce() calls replace NA with "—" for display. The final result has exactly 5 rows, one per employee, with every piece of information needed for a management report.
Summary
Commit the row-keeping rules to memory. The join type is the answer to one question: which rows must survive?
| Join | Keeps from Left | Keeps from Right | Typical Use |
|---|---|---|---|
inner_join() |
matches only | matches only | Strict, both sides required |
left_join() |
all rows | matches only | Enrich the primary table |
right_join() |
matches only | all rows | Rarely used; prefer left_join |
full_join() |
all rows | all rows | Audit gaps on both sides |
semi_join() |
matches only | none (filter) | "Keep rows that exist over there" |
anti_join() |
non-matches only | none (filter) | "Keep rows missing over there" |
References
- dplyr mutating joins documentation. Link
- dplyr filtering joins documentation. Link
- Wickham, H., Cetinkaya-Rundel, M., & Grolemund, G., R for Data Science (2e), Chapter 19: Joins. Link
- Posit, dplyr 1.1.0 Joins announcement (introduces
join_by()and the many-to-many warning). Link - STAT 545, Join cheatsheet. Link
Continue Learning
- R Joins: left, right, inner, full, the parent tutorial that walks through each mutating join with diagrams.
- dplyr group_by() & summarise() Exercises, ten more problems covering aggregation and per-group slicing.
- dplyr filter() and select(), upstream verbs you often use before joining.