dplyr Exercises: 15 Data Manipulation Practice Problems
A set of 15 hands-on dplyr problems, filter(), mutate(), summarise(), group_by(), joins, and across(), each with a runnable solution you can execute on this page. Difficulty progresses from beginner to advanced so you can stop where you get comfortable and come back for the harder problems later.
How should you tackle these 15 dplyr problems?
The 15 problems below are grouped into three blocks of five. Exercises 1-5 cover one or two verbs at a time. Exercises 6-10 mix conditions, across(), case_when(), and joins. Exercises 11-15 stitch three or more concepts into real pipelines like grouped ranking and share of total. Every exercise gives you a starter block, hides the solution behind a reveal, and explains the result. Solve, compare, read the explanation, and move on. All code runs in a single shared R session, so the warm-up below loads dplyr once, the exercises after it do not need to reload it.
Read the pipe top-to-bottom: take mtcars, keep only 4-cylinder cars, then reduce the 11 surviving rows into a single row showing the count and the rounded mean mpg. If that pipeline looks natural to you, Exercises 1-5 should feel fast. If it does not, that is exactly what the first five problems are for.
library(dplyr); none of the 15 exercise blocks repeat it. Variables you create in one block are visible in every later block, just like cells in a Jupyter notebook. Use my_ prefixed names in your own exercise code so you do not accidentally overwrite warm-up variables.Easy (1-5): filter, select, mutate, summarise basics
These five exercises use one or two verbs each. If you have read the dplyr filter() and select() tutorial, you already have what you need.
Exercise 1: Filter and select the fuel-efficient cars
Keep the cars with mpg > 25 and show only mpg, cyl, hp, and the car name. mtcars stores the name as a row name, not a column, you will need to lift it into a column before select() can see it.
Click to reveal solution
Explanation: mutate(car = rownames(mtcars)) copies the row names into a regular column so select() can include them. Six cars clear the 25-mpg bar, and all six are 4-cylinder, no surprise given a 1974 dataset.
Exercise 2: Add a power-to-weight column
Add a new column pwr_wt equal to hp / wt rounded to one decimal, and show the first 5 rows with only the relevant columns (mpg, hp, wt, pwr_wt).
Click to reveal solution
Explanation: mutate() adds the new column; select() picks the reporting set; head() trims to the first five rows. The pwr_wt column measures horsepower per unit of car weight, a cleaner "how punchy is this car" index than raw hp.
filter() and mutate(), = is assignment, equality is ==. Writing filter(mtcars, mpg = 20) attempts to assign 20 to an argument called mpg and fails with a confusing error. Always use filter(mtcars, mpg == 20) when testing equality. The same rule applies to any condition in case_when() or if_else().Exercise 3: Sort by mpg and show the top 5
Arrange the cars in descending mpg order and show the top 5 with their name, mpg, and cylinder count.
Click to reveal solution
Explanation: arrange(desc(mpg)) sorts in descending order, without desc() you would get the least efficient cars first. The Toyota Corolla wins the 1974 fuel-economy crown, and every car in the top 5 is 4-cylinder.
Exercise 4: Count and mean mpg per cylinder group
For each cylinder group (4, 6, 8), compute the count of cars and the mean mpg rounded to one decimal.
Click to reveal solution
Explanation: group_by(cyl) creates three implicit sub-tables; summarise() collapses each into a single row. The .groups = "drop" argument returns a plain tibble instead of a still-grouped one, saves you an ungroup() step later. As expected, fewer cylinders mean better mileage: 26.7 mpg for 4-cyl, 15.1 mpg for 8-cyl.
Exercise 5: Count iris flowers by species and add percentage
Use count() to count iris rows per Species, then mutate() a pct column that gives each species' share as a rounded percentage.
Click to reveal solution
Explanation: count(Species) is shorthand for group_by(Species) |> summarise(n = n()). The iris dataset is perfectly balanced, each species has exactly 50 observations, so each is 33.3% of the total. In an unbalanced real-world dataset this same pattern is how you spot class imbalance for a classifier.
Medium (6-10): compound logic, across(), case_when, joins
These five exercises combine two or more concepts. If you can solve them without looking at the reveal, you are comfortable with everyday dplyr.
Exercise 6: Filter with two conditions against the dataset mean
Keep only cars where mpg is above the dataset mean AND weight is below the dataset mean. Show mpg, hp, wt, sorted by mpg descending.
Click to reveal solution
Explanation: mean(mpg) and mean(wt) are evaluated against the full mtcars table before filtering, so both thresholds come from the original 32-row dataset. Separating conditions with a comma inside filter() is identical to chaining with &, but the comma reads more naturally for simple AND logic.
Exercise 7: Label cars as Economy, Standard, or Guzzler with case_when
Create a type column using case_when() with the rules: mpg > 25 → Economy, mpg >= 15 → Standard, everything else → Guzzler. Then count cars per type, sorted by count.
Click to reveal solution
Explanation: case_when() checks each condition top to bottom and assigns the first matching label. Because a car with mpg = 30 matches both mpg > 25 and mpg >= 15, the order matters, putting mpg >= 15 first would classify every Economy car as Standard. The final TRUE ~ "Guzzler" is the catch-all default.
Exercise 8: Mean and standard deviation of every numeric column with across()
For each iris species, compute the mean and standard deviation of every numeric column in one call. Round both statistics to 2 decimals. The result columns should be named like Sepal.Length_mean, Sepal.Length_sd, etc.
Click to reveal solution
Explanation: across(where(is.numeric), ...) picks every numeric column, then the named list applies both mean and sd to each. The .names = "{.col}_{.fn}" glue template produces tidy column names you can read directly. This single call replaces four separate summarise(mean_...)/summarise(sd_...) lines, one of the biggest ergonomic wins in modern dplyr.
across() is the modern successor to the _at, _if, and _all suffix family. If you see older tutorials using summarise_if(is.numeric, mean) or mutate_at(vars(a, b), scale), the current idiom is summarise(across(where(is.numeric), mean)) and mutate(across(c(a, b), scale)). One pattern to learn covers every multi-column reduction.Exercise 9: Convert iris column names to snake_case
Rename every column of iris to snake_case (lowercase, dots replaced with underscores), then move species to the first position. Show the first 4 rows.
Click to reveal solution
Explanation: rename_with() takes a function and applies it to every column name. The lambda ~ tolower(gsub("\\.", "_", .x)) lowercases the name and swaps the literal dot for an underscore, the backslash-dot escapes the regex metacharacter. select(species, everything()) is the standard idiom for "move this column first, keep the rest as-is."
Exercise 10: Find employees without a matching department
Create two small data frames as shown in the starter block, then use anti_join() to find the employees whose department is not listed in the departments table. As a sanity check, also left_join() the two tables so you can see every employee's budget (or NA when there is no match).
Click to reveal solution
Explanation: left_join() keeps every employee and attaches their department budget when available, David has no matching department, so budget comes back as NA. anti_join() answers the same question inverted: give me only the rows from the left table that have no match in the right table. It is the standard dplyr tool for "find the orphans."
Hard (11-15): grouped ranking, share of total, real pipelines
These five stitch three or more concepts into the kind of pipelines you actually write on the job.
Exercise 11: Rank cars by mpg within each cylinder group
For each cylinder group, rank the cars by mpg (rank 1 = most efficient), keep the top 3 per group, and show the car name, cylinder count, mpg, and rank, sorted by cyl then rank.
Click to reveal solution
Explanation: rank(-mpg) ranks by negative mpg so the highest mpg gets rank 1. The grouped mutate() keeps the 32 rows but numbers each within its cylinder sub-table; filter(rank <= 3) then trims to the top 3 per group. Honda Civic and Lotus Europa tie at mpg 30.4, so both receive the average rank 3.5, swap to min_rank() or dense_rank() if you prefer integer ranks with ties broken differently.
Exercise 12: Each car's hp as a percentage of its cylinder-group total
For each cylinder group, compute every car's horsepower as a rounded percentage of that group's total horsepower. Show car, cyl, hp, and hp_pct for the top 10 rows sorted by cyl then hp_pct descending.
Click to reveal solution
Explanation: Inside a grouped mutate(), aggregate functions like sum() operate on the current group, not the whole table. So sum(hp) for 4-cylinder cars returns the total hp across the 11 4-cyl cars, and dividing each car's hp by that total gives its share of the group. This is the canonical "share of total" pattern in dplyr, re-use it for market share, portfolio weight, or any per-group proportion.
ungroup() after a grouped mutate if you want later verbs to act row-wise again. A tibble that is still grouped silently changes the behaviour of mutate(), summarise(), and even slice(). Adding ungroup() at the end of a grouped pipeline is a two-character fix that prevents hours of debugging.Exercise 13: Two heaviest cars per cylinder group
Use slice_max() to keep the two heaviest cars (by wt) in each cylinder group. Return car, cyl, wt, and mpg.
Click to reveal solution
Explanation: slice_max(wt, n = 2) picks the two rows with the largest wt per group. You could write this yourself with arrange(desc(wt)) |> head(2), but slice_max() is clearer, handles ties with with_ties = TRUE by default, and makes your intent obvious to the next reader of the code.
slice_max(col, n = k) over the older top_n(k, col). slice_max() is the current dplyr idiom and has a matching slice_min() for the opposite end. top_n() still works but is marked "superseded" in the dplyr reference, new code should use slice_max/slice_min/slice_sample/slice_head/slice_tail.Exercise 14: A five-step real pipeline, manual-transmission fuel economy
Chain this five-step pipeline:
- Keep only manual-transmission cars (
am == 1). - Add a
kplcolumn equal tompg * 0.425(an approximate mpg-to-km-per-litre conversion), rounded to 2 decimals. - Group by
cyl. - Summarise: count of cars and mean
kplper group. - Sort by mean
kpldescending.
Click to reveal solution
Explanation: Five verbs, each doing exactly one job, composed with pipes. The intermediate shape changes three times: 32-row mtcars → 13 manual cars → 13 manual cars with kpl → 3 grouped rows. Writing each verb on its own line makes the pipeline readable and lets you comment out any single step to debug.
Exercise 15: Stratified 30% sample per iris species
Take a random 30% sample from each iris species (so roughly 15 rows per species, 45 total). Use set.seed(42) for reproducibility and count the rows per species in the result to verify the stratification worked.
Click to reveal solution
Explanation: Inside a grouped pipeline, slice_sample(prop = 0.3) takes 30% of each group independently, that is what "stratified sampling" means. Without group_by(), you would get 45 rows drawn uniformly from the full 150-row table, with no guarantee of species balance. set.seed(42) fixes the random draw so every reader sees identical counts. This is the standard dplyr recipe for building a stratified train/test split for classification.
Summary
The 15 problems together exercise every core dplyr verb and the two most common helper patterns (case_when() and across()).
| Verb / helper | Exercises that use it |
|---|---|
filter() |
1, 6, 14 |
select() |
1, 2, 3, 9, 11, 12, 13 |
mutate() |
2, 7, 9, 11, 12, 14 |
arrange() |
3, 6, 11, 12, 14 |
group_by() + summarise() |
4, 8, 14 |
count() |
5, 7, 15 |
case_when() |
7 |
across() + where() |
8 |
rename_with() |
9 |
left_join() + anti_join() |
10 |
rank() / slice_max() / slice_sample() |
11, 13, 15 |
| Grouped share of total | 12 |
If you solved Exercises 1-10 without peeking, you are comfortable with everyday dplyr. If you solved 11-15 as well, you are ready for window functions, complex joins, and real analytical pipelines. Come back to the failed ones tomorrow, spaced practice beats cramming every time.
FAQ
Q: Should I use the native pipe |> or the magrittr pipe %>%? Both work. The native |> is built into base R from version 4.1 and is the current recommendation, no package needed, marginally faster, and the syntax is simpler. Use %>% only when you need the dot placeholder (df %>% lm(y ~ x, data = .)) or the assignment pipe %<>%, neither is in the native pipe yet. Every solution above uses |>.
Q: Does dplyr change my data frame in place? No. dplyr verbs always return a new tibble; they never modify the original. If you want to keep a transformed version, assign it to a variable: my_clean <- mtcars |> mutate(...). This immutability is what makes pipelines safe to compose and debug.
Q: Does group_by() stay active after summarise()? summarise() peels off one level of grouping. If you grouped by one variable, the result is ungrouped; if you grouped by two variables, the result is still grouped by the first. To be explicit, and to avoid surprising later verbs, pass .groups = "drop" to summarise() or add ungroup() after it. Every grouped solution above uses one of these.
Q: How does dplyr handle NA inside filter()? filter() drops any row where the condition evaluates to NA. So filter(df, col > 5) silently removes both col <= 5 rows and col == NA rows. If you want to keep NA rows explicitly, write filter(df, col > 5 | is.na(col)). For aggregate functions like mean() and sum(), pass na.rm = TRUE to ignore missing values.
Q: What is the difference between slice_max(col, n = k) and top_n(k, col)? slice_max() is the modern replacement for top_n(). Both keep the k rows with the largest value of col, but slice_max() has clearer argument order (column first, n second), a companion slice_min() for the opposite end, and a with_ties argument. top_n() still works but is marked "superseded", prefer slice_max() in new code.
References
- Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G., R for Data Science, 2nd Edition. Chapter 3: Data transformation. Link
- dplyr documentation,
filter()reference. Link - dplyr documentation,
mutate()reference. Link - dplyr documentation,
summarise()reference. Link - dplyr documentation,
across()for multi-column operations. Link - dplyr documentation, mutating joins (
left_join,inner_join, etc.) and filtering joins (anti_join,semi_join). Link - dplyr documentation,
slice_max(),slice_min(),slice_sample(). Link - Posit, Data transformation with dplyr cheatsheet. Link
- R Core Team, An Introduction to R. Link
Continue Learning
- dplyr filter() and select(), the parent tutorial with every filtering and selection pattern explained in depth
- dplyr group_by and summarise, the full story on grouped reductions,
.groups, and multi-column summaries - dplyr filter & select Exercises, a narrower 12-problem set focused just on filter() and select()
- R Joins, reference for
inner_join,left_join,anti_join, and the rest of the join family