Tidyverse Exercises in R: 50 Real-World Practice Problems
Fifty cross-package practice problems combining dplyr, tidyr, stringr, lubridate, and purrr on real-world workflows. The intermediate sweet spot where you have to pick the right verb from the right package and chain them. Hidden solutions.
Section 1. Reshape and wrangle (8 problems)
Exercise 1.1: Wide to long
Scenario: You receive quarterly sales as a wide table with columns Q1-Q4. Pivot to long format with columns region, quarter, sales.
Difficulty: Beginner
Click to reveal solution
Explanation: pivot_longer collapses wide columns into 2: a names column and a values column. cols accepts tidyselect (Q1:Q4, starts_with("Q"), where(is.numeric), etc.).
Exercise 1.2: Long to wide
Scenario: Same data, opposite direction. Take a long table of (region, quarter, sales) and pivot to one column per quarter.
Difficulty: Beginner
Click to reveal solution
Explanation: pivot_wider goes the other way: each unique value of names_from becomes a new column, with the values_from values placed in.
Exercise 1.3: Separate a combined column
Scenario: A column named full_name contains "Last, First" entries. Split into last and first.
Difficulty: Intermediate
Click to reveal solution
Explanation: separate_wider_delim (tidyr 1.3+) splits by a fixed delimiter into the named columns. Older separate() still works but is superseded.
Exercise 1.4: Fill down missing values
Scenario: A pivoted table has merged cells; only the first row of each group has the value, others are NA. Carry the value down through the group.
Difficulty: Intermediate
Click to reveal solution
Explanation: tidyr::fill carries the last non-NA value forward. .direction = "down" (default), "up", "downup", "updown". Common when reading exported reports.
Exercise 1.5: Complete missing combinations
Scenario: A table of (region, quarter, sales) is missing some region/quarter combos. Add the missing combos with sales = 0.
Difficulty: Intermediate
Click to reveal solution
Explanation: complete() generates all combinations of the listed columns and fills missing rows. fill = list(...) sets defaults for added rows. Crucial for correct group-level summaries.
Exercise 1.6: Drop rows with NA in specific columns only
Scenario: Drop rows where Ozone is NA, but tolerate NAs in other columns.
Difficulty: Intermediate
Click to reveal solution
Explanation: drop_na with no args drops rows with any NA; with column names, only those columns are checked. Cleaner than filter(!is.na(Ozone)) for multi-column drops.
Exercise 1.7: Nest a column for grouped operations
Scenario: Group iris by Species, then nest the remaining columns into a list-column.
Difficulty: Intermediate
Click to reveal solution
Explanation: nest() bundles each group's data into a tibble inside a list-column. Each row of the result is one group. Foundation for the "many models" pattern (one model per group).
Exercise 1.8: Unnest a list-column
Scenario: A list-column has variable-length numeric vectors. Expand to long format.
Difficulty: Advanced
Click to reveal solution
Explanation: unnest_longer expands each list element into its own row. Counterpart unnest_wider expands a list of named elements into columns. unnest() does both in older tidyr.
Section 2. Strings and dates (8 problems)
Exercise 2.1: Detect a substring
Scenario: From a vector of email addresses, return only those containing "gmail.com".
Difficulty: Beginner
Click to reveal solution
Explanation: str_detect returns TRUE/FALSE per element. Subset with [. For tibbles, use filter(str_detect(col, "...")).
Exercise 2.2: Extract a phone area code
Scenario: From phone numbers like "(415) 555-1234", extract just the 3-digit area code.
Difficulty: Intermediate
Click to reveal solution
Explanation: str_extract returns the FIRST regex match. \\d{3} matches 3 digits. The opening "(" is the first non-digit so the match starts at the area code.
Exercise 2.3: Pad a numeric ID
Scenario: Format integer IDs as 6-character zero-padded strings: 42 -> "000042".
Difficulty: Beginner
Click to reveal solution
Explanation: str_pad pads to a target width. side specifies left, right, or both. sprintf("%06d", ids) is the base R alternative.
Exercise 2.4: Replace by regex
Scenario: Phone numbers come with various separators: "(415) 555-1234", "415.555.1234", "415 555 1234". Normalize all to just digits.
Difficulty: Intermediate
Click to reveal solution
Explanation: str_replace_all with \\D (any non-digit) replaces every non-digit with the empty string. Robust to any separator pattern.
Exercise 2.5: Parse dates from various formats
Scenario: Mixed input formats: "2024-01-15", "01/15/2024", "Jan 15, 2024". Parse all to Date.
Difficulty: Intermediate
Click to reveal solution
Explanation: lubridate::parse_date_time tries each orders pattern in turn and uses the first that fits. The y/Y, m/M, d/D codes match year/month/day in flexible separators.
Exercise 2.6: Extract month from a Date column
Scenario: From a vector of dates, get the month name as an English string.
Difficulty: Beginner
Click to reveal solution
Explanation: lubridate::month with label = TRUE returns an ordered factor of month names. abbr = FALSE for full names. Counterparts: year(), day(), wday().
Exercise 2.7: Compute age from birth date
Scenario: Given birth dates and a reference date, compute age in years.
Difficulty: Intermediate
Click to reveal solution
Explanation: lubridate::interval creates a date range; dividing by years(1) gives fractional years. as.integer truncates to whole years (matches typical "age" semantics).
Exercise 2.8: Round dates to month start
Scenario: A daily-events table needs to be aggregated monthly. Add a month_start column with each date snapped to the 1st of its month.
Difficulty: Advanced
Click to reveal solution
Explanation: floor_date snaps a date to the start of the unit (week, month, quarter, year). ceiling_date goes the other way; round_date snaps to the nearest. Standard for time-series resampling.
Section 3. Iteration with purrr (8 problems)
Exercise 3.1: Apply a function to each element
Scenario: Compute the square of each element in 1:10 using map_dbl.
Difficulty: Beginner
Click to reveal solution
Explanation: map_dbl returns a numeric vector. The ~ .x^2 is purrr's lambda syntax for function(.x) .x^2. Type-stable variants: map_chr, map_int, map_lgl, map_dfr.
Exercise 3.2: Read multiple CSVs
Scenario: You have a vector of CSV file paths. Read all into a single combined data frame, with a source column tagging each row's file.
Difficulty: Intermediate
Click to reveal solution
Explanation: map_dfr binds the per-file data frames together rowwise. Pass .id = "source" if you want a column tracking which input each row came from. For real files: map_dfr(files, readr::read_csv).
Exercise 3.3: Two-input map
Scenario: Given two vectors of equal length, compute element-wise: x^y.
Difficulty: Intermediate
Click to reveal solution
Explanation: map2 walks two vectors in parallel. .x and .y refer to the two inputs. For 3+ vectors, use pmap with a list of vectors.
Exercise 3.4: Filter a list with keep
Scenario: From a list of numeric vectors, keep only those whose mean exceeds 5.
Difficulty: Intermediate
Click to reveal solution
Explanation: keep retains list elements satisfying the predicate; discard is the inverse. Like Filter from base R but more readable.
Exercise 3.5: Find first element matching a condition
Scenario: Find the first list element whose length is greater than 2.
Difficulty: Intermediate
Click to reveal solution
Explanation: detect returns the first match (or NULL); detect_index returns its position; some/every test if any/all match. Like Find in base R.
Exercise 3.6: Reduce: cumulative join
Scenario: A list of three data frames, each with an id column. Inner-join them all into one. Use reduce.
Difficulty: Advanced
Click to reveal solution
Explanation: reduce applies a binary function repeatedly: f(f(f(a, b), c), d). For joining many data frames or accumulating any cumulative operation, this is the idiomatic pattern.
Exercise 3.7: Safely wrapper
Scenario: Apply log() to a list of values, where some are negative. Use safely to capture errors instead of throwing.
Difficulty: Advanced
Click to reveal solution
Explanation: safely wraps a function so it never throws. map returns a list of list(result, error) pairs. log of negative is NaN with a warning, not an error, so for true demo try safely(stop). Use possibly() to get a default value instead.
Exercise 3.8: pmap with a tibble
Scenario: A tibble has columns x, y, z. Compute x*y + z per row using pmap.
Difficulty: Advanced
Click to reveal solution
Explanation: pmap takes a list of equal-length inputs; ..1, ..2, ..3 refer to them positionally. For 2 args use map2; pmap scales to N. For row-wise computation in a tibble, mutate(across(...)) or rowwise() are alternatives.
Section 4. Group-and-iterate (8 problems)
Exercise 4.1: Run a model per group
Scenario: Fit a linear model of mpg ~ wt separately for each cyl group on mtcars. Return a tibble with cyl and model.
Difficulty: Intermediate
Click to reveal solution
Explanation: Nest each group's data into a list-column, then map a model function over it. Each row holds (cyl, data, model). Pattern of "many models" workflow.
Exercise 4.2: Extract coefficients from per-group models
Scenario: Continuing from 4.1, extract the slope of wt for each cyl group as a tidy tibble.
Difficulty: Advanced
Click to reveal solution
Explanation: broom::tidy turns a model into a tidy tibble of coefficients. Mapping it over the model column then unnesting flattens. Filter to the term of interest.
Exercise 4.3: Per-group summary as a list-column
Scenario: For each Species in iris, compute summary() of Sepal.Length and store the result as a list-column.
Difficulty: Intermediate
Click to reveal solution
Explanation: Wrapping in list() is the trick to put a non-scalar into a summarise cell. Result is a tibble where each summary_stats entry is the named numeric vector returned by summary().
Exercise 4.4: Compute multiple stats with across
Scenario: Compute mean and sd for every numeric column of iris, per Species.
Difficulty: Intermediate
Click to reveal solution
Explanation: across() with a named list of functions runs each function on each column. .names template controls the resulting column names. {.col} is the column name; {.fn} is the function name.
Exercise 4.5: First and last row per group
Scenario: Per stock, return both the first and last price entries (chronologically) in one tibble.
Difficulty: Advanced
Click to reveal solution
Explanation: slice with c(1, n()) picks position 1 and the last position per group. Tag rows with mutate(role = c("first","last")) if you need to distinguish them.
Exercise 4.6: Apply a custom function per group
Scenario: For each Species, compute the correlation between Sepal.Length and Petal.Length.
Difficulty: Intermediate
Click to reveal solution
Explanation: summarise can take any function returning a scalar. For multi-value returns (like cor.test), use reframe or wrap in list().
Exercise 4.7: Group-wise resampling
Scenario: From iris, draw 5 random rows per Species.
Difficulty: Intermediate
Click to reveal solution
Explanation: slice_sample with by = Species samples per group. Use prop = 0.1 for proportional sampling. set.seed for reproducibility.
Exercise 4.8: Efficient group iteration with group_modify
Scenario: Per Species, fit a quick polynomial regression and return the residuals.
Difficulty: Advanced
Click to reveal solution
Explanation: group_modify applies a function to each group's data frame and returns a combined tibble. .x is the group's data frame. Cleaner than nest+map+unnest for this pattern.
Section 5. End-to-end pipelines (10 problems)
Exercise 5.1: Filter, group, summarise
Scenario: From diamonds: keep only ideal-cut, group by clarity, compute mean price and count, sort by mean price descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: Standard 4-step pipeline: filter -> group -> summarise -> arrange. Read top to bottom; pipe chains the result of each into the next.
Exercise 5.2: Build a customer activity summary
Scenario: From a transactions table, build per-customer: first purchase, last purchase, total spend, days active.
Difficulty: Intermediate
Click to reveal solution
Explanation: Multiple stats in one summarise. .groups = "drop" prevents the result from staying grouped (the next operation might surprise you otherwise).
Exercise 5.3: Clean text and aggregate
Scenario: A column has product names with inconsistent casing and trailing spaces. Standardize to lowercase trimmed, then count per name.
Difficulty: Intermediate
Click to reveal solution
Explanation: Always normalize before aggregating; otherwise duplicates inflate the cardinality. count(..., sort = TRUE) is shorthand for count + arrange(desc(n)).
Exercise 5.4: Extract date parts and aggregate monthly
Scenario: From economics dataset, compute mean unemployment per year.
Difficulty: Intermediate
Click to reveal solution
Explanation: lubridate::year extracts the year integer; group + summarise then aggregates. For monthly: floor_date(date, "month").
Exercise 5.5: Pivot then summarise
Scenario: From quarterly sales (region x Q1-Q4), compute total annual sales per region.
Difficulty: Intermediate
Click to reveal solution
Explanation: Reshape to long, then aggregate. Could also compute as mutate(annual = Q1+Q2+Q3+Q4) directly; the long format is more flexible if quarters change.
Exercise 5.6: Filter + mutate + arrange + slice
Scenario: From mtcars, find the 5 lightest cars among those with mpg > 20. Return name and weight.
Difficulty: Intermediate
Click to reveal solution
Explanation: Five-step pipeline. slice_head is the dplyr verb for picking the top N rows; slice_tail for bottom N.
Exercise 5.7: Join then aggregate
Scenario: Two tables: customers and orders. Compute total spend per customer, including customers with 0 orders.
Difficulty: Intermediate
Click to reveal solution
Explanation: left_join keeps all customers; sum with na.rm gives 0 for those without orders. Including the name column in group_by prevents losing it after summarise.
Exercise 5.8: Group-wise rate calculation
Scenario: A poll table has total respondents and yes responses per region. Compute the yes rate, sorted descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: For grouped summaries that ALREADY have the components, just mutate + arrange. No group_by needed.
Exercise 5.9: Detect and fix data quality issues
Scenario: A messy column has email addresses with various trailing whitespace, mixed case, and some entries that aren't valid emails. Clean to lowercase trimmed, then keep only entries containing "@".
Difficulty: Advanced
Click to reveal solution
Explanation: Two-step cleanup then filter. str_trim removes leading/trailing whitespace; str_to_lower normalizes case. str_detect with "@" is a minimum sanity check.
Exercise 5.10: Reproducible analysis chain
Scenario: Build a per-cylinder summary of mtcars: count, mean mpg, top car name. Save as a tibble suitable for a report.
Difficulty: Advanced
Click to reveal solution
Explanation: Combining aggregations (n, mean) with row-extraction (which.max gives the position of max within the group; index back into car). Round for presentation. .groups = "drop" releases grouping.
Section 6. Advanced multi-package (8 problems)
Exercise 6.1: Many models per group
Scenario: Fit a linear regression of mpg ~ wt per cyl group, extract R-squared, slope, p-value into a tidy table.
Difficulty: Advanced
Click to reveal solution
Explanation: nest -> map model -> map broom::glance gets one-row-per-model summaries. unnest flattens. Foundation of comparative modeling across groups.
Exercise 6.2: Time-window aggregation
Scenario: From daily events, compute the rolling 7-day count per user.
Difficulty: Advanced
Click to reveal solution
Explanation: slider::slide_dbl walks a vector with a window. .before = 6 plus current = 7-day window. .complete = TRUE returns NA for windows that aren't fully populated.
Exercise 6.3: Detect changes between snapshots
Scenario: Two daily snapshots of a customer-status table. Find customers whose status changed.
Difficulty: Advanced
Click to reveal solution
Explanation: inner_join with suffixes creates side-by-side columns; filter keeps changed rows. Standard "diff snapshots" pattern for daily reconciliations.
Exercise 6.4: Extract structured data from text
Scenario: A column has free text like "Order #1234 placed by user_42 for $99.99". Extract order_id, user_id, and amount into separate columns.
Difficulty: Advanced
Click to reveal solution
Explanation: Lookbehind (?<=...) matches a position preceded by but not consuming the pattern. str_extract grabs the first match. as.integer/as.numeric coerce. Useful for log parsing.
Exercise 6.5: Compute per-group quantiles in long format
Scenario: Per cyl, compute the 25th/50th/75th percentile of mpg in long format (one row per quantile per group).
Difficulty: Advanced
Click to reveal solution
Explanation: reframe (dplyr 1.1+) allows multiple rows per group. summarise enforces 1 row per group. Use for quantile tables, multi-statistic outputs.
Exercise 6.6: Run an analysis safely on multiple datasets
Scenario: A list of tibbles. For each, fit lm(y ~ x). Some may have insufficient rows. Use safely + map.
Difficulty: Advanced
Click to reveal solution
Explanation: safely wraps the model fit so failures become $error entries instead of throwing. Inspect with map_lgl to filter or report. Standard pattern for batch processing.
Exercise 6.7: Build a typed summary across many columns
Scenario: For every numeric column of iris, compute mean and sd, with results in a tidy two-column-per-stat layout.
Difficulty: Advanced
Click to reveal solution
Explanation: Compute multi-stat with across, then pivot_longer with a regex names_sep to split column names into variable and stat. Foundation for tidy multi-statistic tables.
Exercise 6.8: End-to-end ETL slice
Scenario: Read messy raw data, clean strings, parse dates, join with a lookup, aggregate, and save a clean tibble. Build the whole pipeline.
Difficulty: Advanced
Click to reveal solution
Explanation: Five tidyverse packages in one pipeline: stringr (case/trim/remove), lubridate (mdy), dplyr (mutate/join/summarise/arrange), tibble. This shape recurs in every real ETL job.
What to do next
Master the multi-package idiom and the rest of R falls into place. Natural follow-ups:
- Single-package hubs: dplyr-Exercises, ggplot2-Exercises (shipped) for deeper drilling on each.
- Topic hubs (coming): Joins, Window functions, Pivot, Regex, Date-Time.
- Domain practice (coming): Data-Wrangling-Exercises, Data-Cleaning-Exercises, EDA-Exercises target take-home interview shapes.