dplyr Exercises in R: 50 Real-World Practice Problems
Fifty scenario-based dplyr exercises: light on warm-ups, heavy on intermediate problems where you have to combine 2-3 verbs to land the answer. Solutions are hidden behind reveal toggles so you actually try first.
Section 1. Filter and select (10 problems)
Exercise 1.1: Find the fuel-efficient four-cylinders
Scenario: You're vetting cars for a fuel-economy review. From mtcars, list the names and mpg of every 4-cylinder car with mpg above 25.
Difficulty: Beginner
Click to reveal solution
Explanation: filter combines two conditions with comma (= AND). select picks just the columns you need.
Exercise 1.2: Filter by a list of names
Scenario: Marketing wants stats on three specific models. From mtcars, return rows where the car name is one of "Mazda RX4", "Honda Civic", or "Toyota Corolla".
Difficulty: Beginner
Click to reveal solution
Explanation: %in% tests vector membership. Cleaner than three OR conditions for any list-based filter.
Exercise 1.3: Drop columns with negative selection
Scenario: You're preparing a streamlined view. From mtcars, keep all columns EXCEPT vs, am, and carb.
Difficulty: Beginner
Click to reveal solution
Explanation: Prefixing column names with - excludes them. Cleaner than listing the 8 you want to keep.
Exercise 1.4: Find iris flowers in the upper sepal range
Scenario: A botanist asks for flowers with sepal length above the species median. Return rows from iris where Sepal.Length exceeds the median for that species.
Difficulty: Intermediate
Click to reveal solution
Explanation: group_by makes filter compute the median PER SPECIES, not globally. Without it, you'd pick the global median and get a biased species split.
Exercise 1.5: Filter on multiple conditions with NA awareness
Scenario: Air quality team needs valid summer readings. From airquality, return rows where Month is 6, 7, or 8 AND Ozone is not NA.
Difficulty: Intermediate
Click to reveal solution
Explanation: filter drops NAs in the tested column by default (the test returns NA, treated as FALSE). Adding !is.na(Ozone) is the explicit, robust way and makes intent obvious to reviewers.
Exercise 1.6: Range filter with between()
Scenario: A climate analyst wants comfortable summer days. From airquality, return rows where Temp is between 70 and 85 (inclusive) using between().
Difficulty: Beginner
Click to reveal solution
Explanation: between(x, lo, hi) is shorthand for x >= lo & x <= hi. Cleaner and reads naturally for range checks.
Exercise 1.7: Select columns by name pattern
Scenario: A botanist wants only the sepal measurements. From iris, select all columns whose names start with "Sepal".
Difficulty: Beginner
Click to reveal solution
Explanation: starts_with() is a tidyselect helper. Same family: ends_with(), contains(), matches() (regex). Useful when columns share a naming convention.
Exercise 1.8: Filter rows by a regex match in a string column
Scenario: A Mercedes recall list. From mtcars, return rows whose name starts with "Merc" (Mercedes models).
Difficulty: Intermediate
Click to reveal solution
Explanation: grepl returns TRUE/FALSE, perfect for filter. ^Merc anchors to the start of the string. For complex string filters, this is the canonical pattern.
Exercise 1.9: Reorder columns putting key fields first
Scenario: A reporting team wants the cyl column up front, then everything else in original order. Use select with everything().
Difficulty: Intermediate
Click to reveal solution
Explanation: everything() matches all remaining columns. select(cyl, everything()) puts cyl first without losing anything. Common pattern for foregrounding ID or grouping columns.
Exercise 1.10: Filter rows where any numeric column is negative
Scenario: Data-quality audit. Build a tibble with some negative values, then return any row where ANY numeric column is negative using if_any().
Difficulty: Advanced
Click to reveal solution
Explanation: if_any(cols, predicate) returns TRUE if the predicate holds for ANY of the selected columns. Counterpart if_all() requires all. Tidyselect helpers like where() choose columns by predicate.
Section 2. Mutate and transform (10 problems)
Exercise 2.1: Compute fuel efficiency in km per litre
Scenario: You're publishing a non-US version of the mtcars data. Add a kpl column (kilometres per litre) using the conversion 1 mpg = 0.425 kpl.
Difficulty: Beginner
Click to reveal solution
Explanation: mutate adds a new column derived from existing ones. Vectorized, no loop needed.
Exercise 2.2: Bin diamonds into price tiers
Scenario: A jeweller wants to bucket inventory. Add a tier column to diamonds with values "budget" (price < 1000), "mid" (1000-5000), or "premium" (>5000).
Difficulty: Intermediate
Click to reveal solution
Explanation: case_when reads top-to-bottom; the first matching condition wins. The trailing TRUE ~ ... is the catch-all default. Cleaner than nested if_else.
Exercise 2.3: Compute and rank fuel efficiency per cylinder group
Scenario: Your boss wants the relative ranking within each cylinder count. Add mpg_rank_within_cyl showing each car's rank by mpg within its cyl group, with rank 1 being the most efficient.
Difficulty: Intermediate
Click to reveal solution
Explanation: group_by + min_rank ranks within each cylinder count. desc reverses so highest mpg gets rank 1. Always ungroup at the end to avoid surprising downstream code.
Exercise 2.4: Rescale all numeric columns to 0-1
Scenario: A machine-learning model needs features in [0, 1]. Min-max-rescale every numeric column of iris, leaving Species untouched.
Difficulty: Intermediate
Click to reveal solution
Explanation: across(where(is.numeric), fn) applies fn to every numeric column. The ~ defines a lambda; .x is the column.
Exercise 2.5: Replace negative values with zero AND log-transform
Scenario: A skewed numeric column needs cleaning. Take a vector x = c(-2, 0, 3, 10, NA). Convert it to a tibble and add a column log_x that floors negatives to 0 then takes log1p.
Difficulty: Advanced
Click to reveal solution
Explanation: pmax(x, 0) floors negatives at 0 vectorized. log1p handles 0 safely (log(1+0) = 0) where log() would give -Inf. NA propagates correctly.
Exercise 2.6: Build a binary flag column with if_else
Scenario: A reporting view needs a "high efficiency" flag. From mtcars, add is_efficient = TRUE when mpg >= 25, FALSE otherwise.
Difficulty: Beginner
Click to reveal solution
Explanation: if_else() is type-stable (unlike base ifelse) and returns the right type for both branches. The shorter mutate(is_efficient = mpg >= 25) works too; use if_else when you want different non-boolean values per branch.
Exercise 2.7: Compute day-over-day price direction
Scenario: Trading desk wants up/down/flat indicators. From a small price tibble, add a direction column based on whether today's price is above, below, or equal to yesterday's.
Difficulty: Intermediate
Click to reveal solution
Explanation: lag() returns the previous row; comparing with case_when gives a 3-way label. First row has lag()=NA, so direction is NA there. Always arrange first.
Exercise 2.8: Z-score multiple columns with custom names
Scenario: A stats team wants z-scored versions of selected columns kept ALONGSIDE originals. From iris, add z-score versions of Sepal.Length and Petal.Length with suffix _z.
Difficulty: Advanced
Click to reveal solution
Explanation: The .names = "{.col}_z" argument keeps originals AND adds new columns with the suffix. Without it, across overwrites the originals. Crucial when you want to compare or audit.
Exercise 2.9: Row-wise mean of selected columns
Scenario: Each iris flower's "average measurement". Add avg_measurement = mean of the 4 numeric columns per row.
Difficulty: Intermediate
Click to reveal solution
Explanation: rowwise() makes each row a "group of 1". c_across() inside row-wise context collects the values into a vector you can pass to mean. Always ungroup after rowwise().
Exercise 2.10: Impute NAs with the group mean
Scenario: Air-quality cleanup. In airquality, replace NA values in Ozone with the mean Ozone for that month.
Difficulty: Advanced
Click to reveal solution
Explanation: group_by + mutate computes the mean per month, then if_else swaps it in for NAs only. na.rm = TRUE is critical, otherwise the mean itself would be NA. Cleaner than join-based imputation.
Section 3. Summarise and group_by (11 problems)
Exercise 3.1: Mean MPG per cylinder
Scenario: Standard summary table for a quick brief. Compute the mean mpg per cyl group.
Difficulty: Beginner
Click to reveal solution
Explanation: group_by + summarise is the canonical aggregation pattern. mean() collapses each group to one number.
Exercise 3.2: Multiple stats per group
Scenario: Build a per-cylinder summary table with row count, mean, SD, and number of unique gear values. Sort by mean descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: Multiple stats inside one summarise. n() counts rows in the group; n_distinct counts unique values. arrange sorts the result.
Exercise 3.3: Filter groups by size
Scenario: A statistical test needs at least 10 observations per group. From mtcars, keep only the cyl groups with >= 10 cars and report the mean mpg of each.
Difficulty: Intermediate
Click to reveal solution
Explanation: filter inside group_by uses n() for the group size. Drops whole groups whose size is below the threshold before the summarise.
Exercise 3.4: Most-recent record per group
Scenario: You have multi-row event logs and need each user's last action. Build a sample tibble of users with timestamps, then return the row with the latest timestamp per user.
Difficulty: Intermediate
Click to reveal solution
Explanation: slice_max with by = picks the row(s) with the maximum timestamp per group in one verb. dplyr 1.1+ syntax; cleaner than group_by + arrange + slice(1).
Exercise 3.5: Compute group-level proportion
Scenario: A merchandising team wants each diamond's clarity expressed as a share of the row's CUT total. Add a clarity_share column that is the proportion of rows of that clarity within each cut.
Difficulty: Intermediate
Click to reveal solution
Explanation: count creates a per-(cut, clarity) frequency. group_by(cut) + mutate(n / sum(n)) divides each row by its cut's total. ungroup at the end to avoid leaking grouping.
Exercise 3.6: Per-group quantile table
Scenario: A data scientist asks for the 25th, 50th, 75th percentiles of mpg per cyl group, in long format. Use reframe (dplyr 1.1+).
Difficulty: Advanced
Click to reveal solution
Explanation: summarise enforces 1 row per group; reframe allows multiple rows. Perfect for per-group quantile tables.
Exercise 3.7: Top N rows per group
Scenario: Each cyl group's top 3 most fuel-efficient cars. Use slice_max with n = 3 and by =.
Difficulty: Intermediate
Click to reveal solution
Explanation: slice_max with n = 3 and by = returns top 3 per group in one expression. Ties go above n. Use with_ties = FALSE to cap exactly at 3.
Exercise 3.8: Pivot summary to wide format
Scenario: A dashboard needs mean mpg and mean hp per cyl as columns, not rows. Combine summarise across with pivot_longer/wider.
Difficulty: Advanced
Click to reveal solution
Explanation: across(c(mpg, hp), mean, .names = "{.col}_mean") produces mpg_mean and hp_mean columns directly; no pivot needed. The .names template controls naming. For more complex stats (mean and sd of each), pass a list of functions.
Exercise 3.9: Weighted mean per group
Scenario: Inventory team wants the weight-weighted mean price per diamond cut, where the weight is carat. Compute weighted mean price by cut.
Difficulty: Intermediate
Click to reveal solution
Explanation: weighted.mean(x, w) gives sum(x*w)/sum(w). Compare to plain mean: weighted version pulls toward heavier diamonds (which dominate inventory value).
Exercise 3.10: Count and percentage in one pass
Scenario: A demographics view: count and percent of each species in iris. Use count + mutate.
Difficulty: Intermediate
Click to reveal solution
Explanation: count() gives the n column; mutate adds pct. sum(n) is the total since we're not grouped after count(). Standard frequency-table pattern.
Exercise 3.11: Detect outliers per group with IQR
Scenario: A QC workflow flags outliers using the 1.5IQR rule. Per cyl group, count how many rows have mpg outside [Q1 - 1.5IQR, Q3 + 1.5*IQR].
Difficulty: Advanced
Click to reveal solution
Explanation: summarise can use derived columns referenced later in the same call. sum() over a logical vector counts TRUEs. Standard non-parametric outlier detection per group.
Section 4. Joins (9 problems)
Exercise 4.1: Augment data with a lookup
Scenario: Engineering wants every mtcars row labelled with its cyl-group description. Build a 3-row lookup tibble (4 -> "small", 6 -> "medium", 8 -> "large") and join it to mtcars.
Difficulty: Beginner
Click to reveal solution
Explanation: left_join keeps every row of mtcars and adds the label column. Standard "augment with lookup" pattern.
Exercise 4.2: Find orphan records
Scenario: A data-quality check. Build two tibbles representing customers and orders. Identify orders whose customer_id does NOT appear in the customer master table.
Difficulty: Intermediate
Click to reveal solution
Explanation: anti_join returns rows of x with NO match in y. Different key names handled via named vector.
Exercise 4.3: Many-to-many lookup with filter
Scenario: A sales table has multiple rows per product; you want each product's most recent sale joined with that product's metadata. Combine slice_max + left_join.
Difficulty: Intermediate
Click to reveal solution
Explanation: Reduce to one row per product first, then enrich. Avoids the row-multiplication trap of joining first.
Exercise 4.4: Multi-column key join
Scenario: A regional sales report. Join a sales table with a regional pricing table on BOTH region and product.
Difficulty: Intermediate
Click to reveal solution
Explanation: Vector by does multi-column matching. ASIA/X has no price match; gets NA. Useful for data audits: NAs flag missing rows in the lookup.
Exercise 4.5: Reconcile two snapshots
Scenario: Compliance asks: between two daily snapshots of a customer table, which customers were ADDED, which were REMOVED? Use anti_join twice.
Difficulty: Intermediate
Click to reveal solution
Explanation: Two anti_joins with x and y swapped give a complete diff. Standard reconciliation pattern; cleaner than full_join + classify.
Exercise 4.6: Inequality join with join_by()
Scenario: A range lookup. Match each transaction to the price tier whose threshold it falls under. Use join_by with >= (dplyr 1.1+).
Difficulty: Advanced
Click to reveal solution
Explanation: join_by with closest() and >= finds the GREATEST min_amount that is still <= amount, i.e., the tier the value falls into. Critical for range-based lookups: rolling joins, time-window matches, price brackets.
Exercise 4.7: Detect many-to-many duplication
Scenario: Two tables joined on a key that has duplicates on both sides will silently multiply rows. Reproduce and detect this with relationship = "many-to-many".
Difficulty: Intermediate
Click to reveal solution
Explanation: dplyr 1.1+ warns by default if the join is many-to-many (likely accidental). Setting relationship = "many-to-many" says you mean it. Result has 5 rows (id=1 yields 2x2=4, id=2 yields 1). Always check row counts after a join.
Exercise 4.8: Merge two snapshots with coalesce
Scenario: Two dataframes with overlapping IDs but each has some columns the other lacks. Build a merged view that keeps all rows from both, with values from curr preferred over prev.
Difficulty: Intermediate
Click to reveal solution
Explanation: full_join keeps all ids. coalesce(a, b) returns first non-NA. By passing score_curr first, we prefer current; if it's NA, fall back to prev. Standard "upsert" pattern.
Exercise 4.9: Filter via semi_join
Scenario: Keep only orders whose customer is in a VIP list, but DO NOT add VIP columns. Use semi_join.
Difficulty: Intermediate
Click to reveal solution
Explanation: semi_join keeps only x rows that have a match in y, but adds NO columns from y. Great as a filter when you only need the membership check. Counterpart to anti_join.
Section 5. Window functions (7 problems)
Exercise 5.1: Day-over-day percentage change
Scenario: Trading desk wants daily return. Build a tibble of dates and prices in chronological order, then compute the percent change vs. the previous day.
Difficulty: Intermediate
Click to reveal solution
Explanation: lag returns the previous row's value; the first row's pct_change is NA because there's no day before. arrange first to be safe.
Exercise 5.2: Running total per group
Scenario: Each user's lifetime spend at any point. Build a small events tibble and add a column cumulative_spend that is the running total per user, ordered by date.
Difficulty: Intermediate
Click to reveal solution
Explanation: cumsum within group gives running totals. arrange first by (user, date) ensures the cumulative order is chronological per user.
Exercise 5.3: Days since first event per user
Scenario: Cohort analysis. Add a column days_since_first showing how many days since the user's first event for each row.
Difficulty: Intermediate
Click to reveal solution
Explanation: first() inside a group returns the first value; subtracting gives the gap. as.integer drops the difftime class.
Exercise 5.4: Detect first big purchase per user
Scenario: Marketing wants each user's first purchase >= $100. From a small events tibble, return rows where amount >= 100 AND it is that user's first such row.
Difficulty: Advanced
Click to reveal solution
Explanation: filter narrows to qualifying rows; slice_min picks the earliest per user. Users without any qualifying row are absent from the output, as desired.
Exercise 5.5: 3-day rolling mean
Scenario: Smooth a noisy daily price series with a 3-day moving average. Use a manual window via lag.
Difficulty: Advanced
Click to reveal solution
Explanation: Sum the current value with two lags then divide. First two rows are NA (insufficient history). For longer windows, switch to slider::slide_dbl or zoo::rollmean, but lag-based works for small N and stays inside dplyr.
Exercise 5.6: Detect gaps in date sequence per group
Scenario: A daily-events table should have one row per date per user. Identify any user/date combos where the gap to the next date exceeds 7 days.
Difficulty: Advanced
Click to reveal solution
Explanation: lead() reaches FORWARD (next row's value). Subtract to get the gap. Filter on gap > 7 yields rows whose next event is more than a week away. Useful for retention or telemetry monitoring.
Exercise 5.7: Rank with ties
Scenario: A leaderboard with three students tied at the top. Compare min_rank (1, 1, 1, 4) vs dense_rank (1, 1, 1, 2) vs row_number (1, 2, 3, 4).
Difficulty: Intermediate
Click to reveal solution
Explanation: Ties: min_rank skips positions ("Olympic" ranking), dense_rank doesn't, row_number breaks ties arbitrarily. Pick based on how a tie should be reported in your output.
Section 6. Multi-step wrap-up (3 problems)
Exercise 6.1: Build a customer scorecard
Scenario: You're building an executive scorecard. From the events tibble below, produce one row per user containing: number of events, total spend, average spend, days from first to last event, and a "tier" label ("VIP" if total spend >= $200, "Regular" otherwise). Order by total spend descending.
Difficulty: Advanced
Click to reveal solution
Explanation: group_by + summarise computes 4 stats at once. Then mutate adds a derived label. arrange sorts the scorecard. This pattern is the bread-and-butter of analytics work.
Exercise 6.2: Cohort retention table
Scenario: Marketing wants a retention view: users who first transacted in a given month, and how many of them came back in the following month. Compute a 2-column table: cohort (first-transaction month) and retained_next_month (count).
Difficulty: Advanced
Click to reveal solution
Explanation: Step 1: each user's first month (cohort). Step 2: who had an event in cohort+1. Step 3: count cohorts and retentions and join. coalesce(., 0) fills cohorts with no retained users. Foundational analytics pattern.
Exercise 6.3: RFM segmentation
Scenario: A classic marketing segmentation. Compute R (days since last purchase), F (purchase count), M (total spend) per user, then assign each metric to a 1-3 quantile bucket (3 = best). Concatenate into an RFM score string.
Difficulty: Advanced
Click to reveal solution
Explanation: ntile splits a vector into n buckets by quantile. Negate R because LOWER recency is better. paste0 concatenates the three scores into a 3-char segment label. RFM "333" is the top tier; "111" is at-risk. Real campaigns use 5 buckets, but the pattern scales.
What to do next
After 50 exercises, you should be writing solutions before reading explanations. Natural follow-ups:
- dplyr function-deep posts in this site cover each verb's full API: filter, mutate, summarise, group_by, joins, slice_*, across, case_when, ntile, lead/lag, and the rest.
- Topic sub-hubs (coming): joins-only, window-functions-only, and grouped-operations-only practice sets for targeted drilling.
- Combine with tidyr: pivot and nest hubs cover the reshape-and-iterate patterns dplyr alone cannot.