tidyr Reshaping Exercises in R: 18 Real-World Practice Problems
Eighteen scenario-based tidyr reshaping exercises grouped into five themed sections covering pivot_longer, pivot_wider, names_sep, names_pattern, the .value sentinel, separate, unite, complete, and fill. Every problem ships with an expected result, and solutions are hidden behind reveal toggles so you actually try first.
Section 1. pivot_longer basics (4 problems)
Exercise 1.1: Reshape relig_income into long format with pivot_longer
Task: A demographic researcher needs to reshape the built-in tidyr dataset relig_income, which stores 18 US religions across 10 income-bracket columns, into a long-format tibble where each religion-bracket pair is its own row. Use pivot_longer() with cols = -religion to fold the 10 bracket columns into a key column called bracket and a value column called count. Save the result to ex_1_1.
Expected result:
#> # A tibble: 180 x 3
#> religion bracket count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> # 174 more rows hidden
Difficulty: Beginner
Click to reveal solution
Explanation: cols = -religion is the negative-selection idiom: pick every column EXCEPT religion. Eighteen rows times ten bracket columns produces 180 long rows. The names_to argument labels the new key column and values_to labels the value column. Cleaner than gather() (the deprecated predecessor), because the column-selection helpers are aligned with dplyr::select().
Exercise 1.2: Pivot only the first six weeks of billboard
Task: The chart historian working with the tidyr billboard dataset only cares about the first six weeks of each track's run, not all 76 weekly columns. Use pivot_longer() with cols = wk1:wk6 so only those six columns are folded into a week key and rank value column, keeping artist, track, and date.entered as id columns, and drop NA ranks with values_drop_na = TRUE. Save to ex_1_2.
Expected result:
#> # A tibble: 1,902 x 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#> ...
#> # 1,896 more rows hidden
Difficulty: Beginner
Click to reveal solution
Explanation: wk1:wk6 uses the column-range selector (positional, between named columns) so only six of the 76 weekly columns participate. values_drop_na = TRUE discards rows where the track had not yet entered or had already exited the chart in that week, which is the natural behavior for time-series-style ranks. Without it, the result would be 317 * 6 = 1902 rows minus drops; with NA rows kept you carry useless padding.
Exercise 1.3: Strip the wk prefix and cast week to integer
Task: Continuing the billboard analysis, your downstream plotting code expects week to be a clean integer rather than strings like wk3. Redo the pivot from Exercise 1.2 using cols = wk1:wk6 but add names_prefix = "wk" to strip the literal prefix and names_transform = list(week = as.integer) to coerce the resulting key column to integer in one pass. Save to ex_1_3.
Expected result:
#> # A tibble: 1,902 x 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <int> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> ...
Difficulty: Intermediate
Click to reveal solution
Explanation: Doing the strip-and-cast inline avoids a follow-up mutate(week = as.integer(...)) and keeps the type clean from the moment the column is born. names_prefix accepts a regular expression, so "wk" would also drop a literal "wks" if it appeared. A common mistake is using parse_number() post-pivot, which works but loses the chain. For a column that already lacks a prefix, omit names_prefix and use only names_transform.
Exercise 1.4: Find each track's peak chart position
Task: The chart historian now wants every track's all-time peak ranking across the full 76 weeks. Pivot every wk* column to long form using cols = starts_with("wk") with values_drop_na = TRUE, then group by artist and track, summarise to take min(rank) as peak_rank, and sort ascending so the strongest performers float to the top. Save the result to ex_1_4.
Expected result:
#> # A tibble: 317 x 3
#> artist track peak_rank
#> <chr> <chr> <dbl>
#> 1 Destiny's Child Independent Women Part I 1
#> 2 Santana Maria, Maria 1
#> 3 Savage Garden I Knew I Loved You 1
#> 4 Madonna Music 1
#> 5 Aguilera, Christina Come On Over Baby (All I W... 1
#> # 312 more rows hidden
Difficulty: Intermediate
Click to reveal solution
Explanation: starts_with("wk") is the tidyselect helper that grabs every column whose name begins with the prefix, future-proofing against more weekly columns being added. The pivot turns each track-week pair into a row, the group-and-summarise then collapses each track back to its best showing. Dropping NAs is essential here: without it, min() would return NA for any track that hit the chart even briefly before exiting.
Section 2. pivot_wider basics (3 problems)
Exercise 2.1: Reshape fish_encounters from long to wide
Task: The conservation team has the built-in tidyr fish_encounters dataset listing whether each of 19 fish was seen at each of 11 monitoring stations in long form (fish, station, seen). Reshape it so each row is one fish and each column is a station with the seen flag in the cells, using pivot_wider(names_from = station, values_from = seen). Save the result to ex_2_1.
Expected result:
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA NA
#> # 15 more rows hidden
Difficulty: Beginner
Click to reveal solution
Explanation: pivot_wider() is the exact inverse of pivot_longer(): the column you point to with names_from supplies the new column names, and the column you point to with values_from supplies the cell values. Any combination of fish-and-station that did not appear in the long source becomes an NA in the wide result, which is why fish 4845 has NAs from station BCE onward. The default values_fill = NA produces this behavior; Exercise 2.2 swaps it out.
Exercise 2.2: Replace NA cells with zero using values_fill
Task: The previous reshape left NA in every cell where a fish was never observed at a station, but the downstream report expects zeros there instead so absence is encoded numerically. Repeat the pivot_wider() call on fish_encounters and add values_fill = 0 so absent observations are encoded as 0 rather than NA. Save the result to ex_2_2.
Expected result:
#> # A tibble: 19 x 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4845 1 1 1 1 1 0 0 0 0 0 0
#> ...
Difficulty: Intermediate
Click to reveal solution
Explanation: values_fill accepts either a single scalar (applied to every value column) or a named list (per-value-column fills). For binary presence flags, zero is the right semantic default: it means observed-absent rather than unknown. Passing values_fill = list(seen = 0L) would be the most explicit form and is preferred when you have multiple value columns with different natural defaults.
Exercise 2.3: Pivot a tidy summary back to a reporting matrix
Task: A finance analyst has a tidy three-column summary tibble (built inline below) with region, quarter, and revenue, and needs to hand it to leadership formatted as a region-by-quarter matrix where rows are regions and columns are the four quarters. Use pivot_wider(names_from = quarter, values_from = revenue) on the inline tibble and save to ex_2_3.
Expected result:
#> # A tibble: 3 x 5
#> region Q1 Q2 Q3 Q4
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 North 120 135 150 165
#> 2 South 90 100 110 125
#> 3 East 150 160 170 190
Difficulty: Intermediate
Click to reveal solution
Explanation: This is the most common reporting flip: analysts model and aggregate in long form (one row per region-quarter) because that is the shape every dplyr verb expects, then pivot wide only at the very end for human-readable display. Resist the temptation to keep data wide upstream: a tidy long table joins, filters, and groups cleanly, while a wide one fights every verb you throw at it.
Section 3. Advanced pivot_longer (4 problems)
Exercise 3.1: Split encoded column names with names_sep
Task: A health-records inline tibble (built below) has columns male_old, male_young, female_old, female_young, where the column names encode TWO variables (sex and age group) joined by _. Pivot to long form using pivot_longer() with names_to = c("sex", "age") and names_sep = "_" so the joined column names split into two clean key columns plus one value column called n. Save to ex_3_1.
Expected result:
#> # A tibble: 12 x 4
#> country sex age n
#> <chr> <chr> <chr> <dbl>
#> 1 US male old 42
#> 2 US male young 18
#> 3 US female old 55
#> 4 US female young 23
#> 5 UK male old 31
#> # 7 more rows hidden
Difficulty: Intermediate
Click to reveal solution
Explanation: When names_to is a character vector of length two and names_sep is given, each old column name is split into that many pieces and each piece becomes a value in the corresponding key column. Three countries times four columns produces 12 long rows. Without names_sep, you would get a single column called "sex_age" with values like "male_old", forcing a follow-up separate() call.
Exercise 3.2: Extract two key columns with a regex via names_pattern
Task: A macroeconomic inline tibble (built below) has columns like gdp_2020, gdp_2021, pop_2020, pop_2021. Use pivot_longer() with names_to = c("metric", "year") and names_pattern = "(gdp|pop)_(\\d+)" to capture both pieces, then add names_transform = list(year = as.integer) so year lands as integer. Save the result to ex_3_2.
Expected result:
#> # A tibble: 12 x 4
#> country metric year value
#> <chr> <chr> <int> <dbl>
#> 1 US gdp 2020 21000
#> 2 US gdp 2021 23000
#> 3 US pop 2020 331
#> 4 US pop 2021 332
#> 5 UK gdp 2020 2800
#> # 7 more rows hidden
Difficulty: Advanced
Click to reveal solution
Explanation: Use names_pattern (regex with one capture group per names_to entry) when the separator is ambiguous or when you need precise control over what each piece looks like. The pattern (gdp|pop)_(\d+) says: first capture is the literal gdp or pop, second capture is one or more digits. Compared to names_sep = "_" this is safer because it rejects malformed column names instead of silently splitting them in surprising ways.
Exercise 3.3: Spread paired columns with the .value sentinel
Task: The built-in tidyr household dataset has columns dob_child1, dob_child2, name_child1, name_child2, where each child contributes two parallel columns. Use pivot_longer() with names_to = c(".value", "child") and names_sep = "_" so the dob and name prefixes become their own columns and child (containing "child1" or "child2") becomes the long-form id. Save the result to ex_3_3.
Expected result:
#> # A tibble: 8 x 4
#> family child dob name
#> <int> <chr> <date> <chr>
#> 1 1 child1 1998-11-26 Susan
#> 2 1 child2 NA NA
#> 3 2 child1 1996-06-22 Mark
#> 4 2 child2 NA NA
#> 5 3 child1 2002-07-11 Sam
#> 6 3 child2 2004-04-05 Seth
#> # 2 more rows hidden
Difficulty: Advanced
Click to reveal solution
Explanation: The .value sentinel is the killer feature for paired columns. When names_to contains ".value", the matching piece of each old column name becomes the NEW column name in the result rather than a value inside a key column. Four old columns collapse into two value columns (dob, name) plus one new key column (child). Without .value you would get a single value column smashing dates and strings together, type-coerced to character.
Exercise 3.4: Convert numeric stems with names_transform
Task: A field-survey inline tibble (built below) stores depth readings in columns depth_5, depth_10, depth_15, depth_20 (the suffix is meters). Pivot to long using names_prefix = "depth_" to strip the literal stem, then ensure the new depth key is numeric using names_transform = list(depth = as.numeric) so it can sit on a continuous axis. Save to ex_3_4.
Expected result:
#> # A tibble: 12 x 3
#> site depth reading
#> <chr> <dbl> <dbl>
#> 1 A 5 18.2
#> 2 A 10 17.4
#> 3 A 15 16.1
#> 4 A 20 14.3
#> 5 B 5 19.0
#> # 7 more rows hidden
Difficulty: Intermediate
Click to reveal solution
Explanation: Stripping the prefix turns depth_15 into "15", but it is still a character. names_transform lets you coerce in the same call so the column is born as numeric and behaves correctly on plot axes and joins. A common mistake is reaching for as.integer here: it would silently drop fractional depths if any appeared. as.numeric (double) is the safer default.
Section 4. Advanced pivot_wider (3 problems)
Exercise 4.1: Spread two value columns at once
Task: A student-scores inline tibble (built below) has one row per student-subject containing both a score and a grade letter. Use pivot_wider(names_from = subject, values_from = c(score, grade)) so each student gets one row with paired score and grade columns per subject, producing names like score_math and grade_math. Save the result to ex_4_1.
Expected result:
#> # A tibble: 3 x 5
#> student score_math score_eng grade_math grade_eng
#> <chr> <dbl> <dbl> <chr> <chr>
#> 1 Ada 92 88 A B
#> 2 Bea 78 95 C A
#> 3 Cy 85 80 B B
Difficulty: Intermediate
Click to reveal solution
Explanation: Passing a vector to values_from produces one new column PER value-input PER name-input, with names auto-generated as {value}_{name}. Two subjects times two value columns gives four new columns. If you do not like the auto naming, swap in names_glue to take control: names_glue = "{subject}.{.value}" would produce math.score, eng.score, and so on.
Exercise 4.2: Aggregate duplicate cells with values_fn = sum
Task: A retail manager has a transaction log inline tibble (built below) where each row is a sale with store, product, and units_sold, and a store may post multiple sales for the same product. Pivot to a store-by-product matrix that SUMS units, using values_fn = sum inside pivot_wider() to aggregate duplicate combinations rather than crashing with a list-column warning. Save the result to ex_4_2.
Expected result:
#> # A tibble: 2 x 4
#> store Widget Gadget Gizmo
#> <chr> <dbl> <dbl> <dbl>
#> 1 NorthSt 12 7 5
#> 2 SouthSt 8 11 3
Difficulty: Advanced
Click to reveal solution
Explanation: When the (name_id, value_id) pairs are not unique, pivot_wider() cannot decide which value to drop into a cell, so by default it stuffs all values into a list-column and warns. values_fn = sum says aggregate-then-drop; you could pass mean, max, or any reducer that accepts a numeric vector. The alternative pattern is to group_by() and summarise() upstream so the input is already unique, then pivot without values_fn. Either works; values_fn is faster to type.
Exercise 4.3: Build custom column names with names_glue
Task: The portfolio team wants column names like Q1_revenue, Q1_cost, Q2_revenue, Q2_cost from a tidy quarter-by-metric inline tibble (built below). Use pivot_wider() with names_from = c(quarter, metric) and names_glue = "{quarter}_{metric}" to produce the custom naming layout instead of the default underscore-joined order. Save the result to ex_4_3.
Expected result:
#> # A tibble: 2 x 5
#> region Q1_revenue Q1_cost Q2_revenue Q2_cost
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 North 120 80 135 85
#> 2 South 90 60 100 62
Difficulty: Intermediate
Click to reveal solution
Explanation: names_glue accepts a glue template string where each placeholder refers to a column listed in names_from. The default join would also produce Q1_revenue-style names here, but names_glue becomes essential when you want to insert a constant, change separators per-column, or rearrange the order ("{metric}-{quarter}" would give revenue-Q1). Glue templates compose cleanly with dplyr::rename_with() if you want post-hoc adjustments.
Section 5. Companion verbs: separate, unite, complete, fill (4 problems)
Exercise 5.1: Split full_name into first and last with separate_wider_delim
Task: A data-entry sheet has a single full_name column with values like "Ada Lovelace" and "Alan Turing", but downstream code expects first and last as separate columns. Use the modern separate_wider_delim() from tidyr 1.3+ with delim = " " and names = c("first", "last") to split on the space, applied to the inline tibble of five names below. Save the result to ex_5_1.
Expected result:
#> # A tibble: 5 x 2
#> first last
#> <chr> <chr>
#> 1 Ada Lovelace
#> 2 Alan Turing
#> 3 Grace Hopper
#> 4 Donald Knuth
#> 5 Margaret Hamilton
Difficulty: Intermediate
Click to reveal solution
Explanation: separate_wider_delim() is the tidyr 1.3+ replacement for the older separate(). It is stricter by default: if any row has the wrong number of pieces, it errors loudly rather than silently filling with NA. For names with middle parts, pass too_many = "merge" to glue extras into the last piece. The _wider_ naming convention echoes the pivot family: _wider_ adds columns, _longer_ would add rows.
Exercise 5.2: Reunite year/month/day into an ISO date string
Task: An export pipeline split a timestamp into separate year, month, and day integer columns, but the downstream API expects a single ISO-style date string like "2024-03-15". Use unite() on the inline tibble below with col = "date", the three columns in order, and sep = "-" to merge the three components back into one column. Save the result to ex_5_2.
Expected result:
#> # A tibble: 5 x 1
#> date
#> <chr>
#> 1 2024-01-12
#> 2 2024-02-04
#> 3 2024-03-15
#> 4 2024-04-22
#> 5 2024-05-30
Difficulty: Intermediate
Click to reveal solution
Explanation: unite() is the natural inverse of separate() and the same logic as base R paste(..., sep = "-") but column-aware. By default the source columns are dropped; pass remove = FALSE to keep them alongside the new combined column. To get a real Date object instead of a string, follow up with mutate(date = as.Date(date)) or skip unite entirely and use lubridate::make_date(year, month, day).
Exercise 5.3: Complete missing date-by-category combinations
Task: A sensor log inline tibble (built below) has readings only on days where a sensor reported, so days with no signal are silently absent. Use complete(date, sensor_id, fill = list(reading = 0)) to materialize every date-by-sensor combination explicitly, filling the previously absent reading cells with 0 instead of NA. Save the result to ex_5_3.
Expected result:
#> # A tibble: 6 x 3
#> date sensor_id reading
#> <date> <chr> <dbl>
#> 1 2024-06-01 S1 12
#> 2 2024-06-01 S2 0
#> 3 2024-06-02 S1 14
#> 4 2024-06-02 S2 9
#> 5 2024-06-03 S1 0
#> 6 2024-06-03 S2 11
Difficulty: Intermediate
Click to reveal solution
Explanation: complete() builds the Cartesian product of the values seen in the columns you pass, joins it against the original data, and keeps the union. Missing rows are added with NA in unspecified columns, but fill = list(reading = 0) lets you replace those NAs at birth. This is the canonical move before plotting time series so gaps look like genuine zeros instead of disappearing lines. For dense full-range completion, pair with tidyr::full_seq() to enumerate every date.
Exercise 5.4: Carry forward report dates with fill
Task: A daily-report inline tibble (built below) lists the report date only on the first row of each batch and leaves it NA on subsequent rows so the printed report looks tidier. Use fill(date, .direction = "down") to carry the date forward into the NA rows so every observation knows its report date, then save the result to ex_5_4.
Expected result:
#> # A tibble: 6 x 3
#> date item value
#> <date> <chr> <dbl>
#> 1 2024-07-01 alpha 10
#> 2 2024-07-01 beta 12
#> 3 2024-07-01 gamma 14
#> 4 2024-07-02 alpha 11
#> 5 2024-07-02 beta 13
#> 6 2024-07-02 gamma 15
Difficulty: Intermediate
Click to reveal solution
Explanation: fill() replaces NAs in the columns you name with the nearest non-NA value, defaulting to downward direction. This is the classic move when spreadsheets repeat a header value only at the top of each block. Pass .direction = "up" to back-fill from below, or "downup" to fill in both directions when an NA might sit between two known values. Always pair with group_by() if NAs should only fill within partitions, never across them.
What to do next
- pivot_longer and pivot_wider: Reshape Data in R is the parent tutorial covering every option in depth.
- dplyr Exercises in R practices the verbs you chain BEFORE and AFTER a reshape.
- tidyverse Exercises in R puts reshaping in cross-package pipelines.
- Data Wrangling Exercises in R covers the broader cleaning workflow these reshapes plug into.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
tidyr Reshaping Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
88 learners have earned this certificate