Data Wrangling Exercises in R: 50 Real Practice Problems
Fifty practice problems covering the data wrangling lifecycle in R: import, inspect, clean, reshape, combine, aggregate, and ETL pipelines. Real-world scenarios with hidden solutions.
Section 1. Import and inspection (8 problems)
Exercise 1.1: Quick structural inspection
Scenario: A new dataset lands. First-thing checks: dimensions, column types, sample rows. Run all three on airquality.
Difficulty: Beginner
Click to reveal solution
Explanation: Always start with these. dim/str catches type surprises; summary reveals NAs and outliers. glimpse(tibble) is the tidyverse default; vertically rotates so wide tables remain readable.
Exercise 1.2: Read a CSV with custom NAs
Scenario: A CSV uses "N/A", "missing", and "" all as missing markers. Read it, treating all three as NA.
Difficulty: Intermediate
Click to reveal solution
Explanation: read_csv's na = argument accepts a character vector of strings to treat as missing. Inspect after with summary() to confirm NAs were caught.
Exercise 1.3: Read with column type override
Scenario: A CSV has IDs that are zero-padded ("00042"). read_csv guesses integer, dropping the zeros. Force the column to be character.
Difficulty: Intermediate
Click to reveal solution
Explanation: col_types takes a cols() spec. Only override the columns you need; the rest are guessed. Shorthand: col_types = "ci" (c = character, i = integer, d = double, l = logical).
Exercise 1.4: Detect column types
Scenario: Confirm what types read_csv inferred for each column.
Difficulty: Beginner
Click to reveal solution
Explanation: sapply(df, class) returns a named vector of types. For tibbles, str(df) or glimpse(df) does the same visually.
Exercise 1.5: Count NA per column
Scenario: Find which columns of airquality have the most NAs, sorted descending.
Difficulty: Intermediate
Click to reveal solution
Explanation: across applies the NA counter to every column; pivot_longer reshapes to a tidy report. Foundation for completeness audits.
Exercise 1.6: Fast random sample for exploration
Scenario: A 1M-row dataset is too slow to explore interactively. Take a random sample of 5,000 rows with a seed.
Difficulty: Beginner
Click to reveal solution
Explanation: slice_sample(n = ...) for fixed count, prop = ... for proportion. For per-group: by =. Always set seed for reproducibility.
Exercise 1.7: Distinct value counts
Scenario: Quickly understand the cardinality of each diamonds column.
Difficulty: Intermediate
Click to reveal solution
Explanation: n_distinct counts unique values. For factors and characters, this informs decisions about grouping or one-hot encoding. Combine with summarise(across(...)) for a one-liner audit.
Exercise 1.8: Build a column profile
Scenario: For each numeric column of mtcars, return min, mean, max, NA count.
Difficulty: Advanced
Click to reveal solution
Explanation: Multi-stat across with .names, then pivot to a tidy column-by-column profile. skimr::skim() is the production version.
Section 2. Cleaning (10 problems)
Exercise 2.1: Drop rows with NA in target column
Scenario: Drop rows where Ozone is NA from airquality.
Difficulty: Beginner
Click to reveal solution
Explanation: drop_na with no args drops any-NA rows; with named columns, only those are checked.
Exercise 2.2: Replace NAs with column mean
Scenario: Impute NA values in Ozone with the column mean.
Difficulty: Intermediate
Click to reveal solution
Explanation: Compute mean with na.rm, swap in via if_else. Be aware mean imputation distorts variance; use only for quick analyses.
Exercise 2.3: Coerce string to numeric, handling errors
Scenario: A column has mostly numbers but some entries like "n/a", "x". Convert to numeric; non-numeric become NA.
Difficulty: Intermediate
Click to reveal solution
Explanation: as.numeric returns NA for non-numeric with a warning. Wrap in suppressWarnings if you've already accepted the conversion. readr::parse_number is more flexible (extracts numbers from "$1,234.50").
Exercise 2.4: Trim whitespace and standardize case
Scenario: A name column has trailing spaces and mixed case. Standardize to title case, trimmed.
Difficulty: Intermediate
Click to reveal solution
Explanation: str_trim removes leading/trailing whitespace; str_to_title capitalizes first letters. Always normalize text columns before grouping or joining.
Exercise 2.5: Deduplicate by key
Scenario: A customer table has duplicate emails. Keep the first row per email.
Difficulty: Beginner
Click to reveal solution
Explanation: distinct(col, .keep_all = TRUE) keeps the first occurrence of each unique value, preserving all columns. Without .keep_all, only the deduping column is returned.
Exercise 2.6: Standardize categorical labels
Scenario: A region column has variants: "USA", "us", "United States", "U.S.A.". Map all to "US".
Difficulty: Intermediate
Click to reveal solution
Explanation: case_when handles multi-value mappings cleanly. For more complex matching, build a lookup tibble and join.
Exercise 2.7: Detect and flag outliers
Scenario: Add an is_outlier flag to mtcars: TRUE if mpg is outside [Q1 - 1.5IQR, Q3 + 1.5IQR].
Difficulty: Intermediate
Click to reveal solution
Explanation: Tukey's fences. Block expression in mutate computes intermediate values and returns the final one. For per-group, wrap in group_by.
Exercise 2.8: Drop rows with too many NAs
Scenario: Drop rows where MORE THAN half the columns are NA.
Difficulty: Advanced
Click to reveal solution
Explanation: is.na on a tibble returns a logical tibble; rowSums counts TRUEs per row; filter keeps rows where the count is within threshold. Standard sparsity filter.
Exercise 2.9: Cap extreme values (winsorize)
Scenario: Replace mpg values above the 95th percentile with the 95th percentile value.
Difficulty: Intermediate
Click to reveal solution
Explanation: Winsorizing caps extremes without removing rows. pmin(x, cap) takes elementwise minimum. For both tails: combine with pmax.
Exercise 2.10: Validate against a schema
Scenario: Verify that age is between 0 and 120, that email contains "@", and amount is non-negative. Return rows failing any check.
Difficulty: Advanced
Click to reveal solution
Explanation: Compute per-rule flags, then filter rows failing ANY rule. For production, the validate or pointblank packages structure this with named rules and reports.
Section 3. Reshape (8 problems)
Exercise 3.1: Wide to long
Scenario: Pivot a wide quarterly table to long format.
Difficulty: Beginner
Click to reveal solution
Explanation: pivot_longer collapses many columns into name/value pairs. cols accepts tidyselect.
Exercise 3.2: Long to wide
Scenario: Pivot a long sales table to wide format.
Difficulty: Beginner
Click to reveal solution
Explanation: pivot_wider expands each names_from value into a column with values_from values placed.
Exercise 3.3: Pivot longer with names_pattern
Scenario: Columns are named revenue_2023, revenue_2024. Pivot to long with separate metric and year columns.
Difficulty: Intermediate
Click to reveal solution
Explanation: names_pattern with capture groups splits column names into multiple new columns. Useful when wide column names encode multiple variables.
Exercise 3.4: Separate a single column
Scenario: A column "address" contains "123 Main St, Boston, MA". Split into number, street, city, state.
Difficulty: Intermediate
Click to reveal solution
Explanation: separate_wider_regex specifies a regex per output column, with separators in between. tidyr 1.3+. Cleaner than chained separate calls.
Exercise 3.5: Unite columns
Scenario: Combine year, month, day columns into a single ISO date string.
Difficulty: Beginner
Click to reveal solution
Explanation: unite is the inverse of separate. sep is the joining character. By default removes the source columns; use remove = FALSE to keep.
Exercise 3.6: Pivot wider with multiple value columns
Scenario: A long table has columns id, metric, mean, sd. Pivot wide so each metric becomes two columns: <metric>_mean, <metric>_sd.
Difficulty: Advanced
Click to reveal solution
Explanation: Pass a vector to values_from for multi-value pivots. Resulting column names combine: mean_a, mean_b, sd_a, sd_b.
Exercise 3.7: Complete missing combinations
Scenario: Fill in missing month entries per region with sales = 0.
Difficulty: Intermediate
Click to reveal solution
Explanation: complete generates the full grid of region x month, filling missing with the supplied default. Crucial for time-series and panel data integrity.
Exercise 3.8: Nest and unnest
Scenario: Group iris by Species and nest the data, then unnest back.
Difficulty: Intermediate
Click to reveal solution
Explanation: nest creates a list-column per group; unnest expands it back. Round-trip is lossless. Foundation of "many models" workflows.
Section 4. Combine, joins, bind (8 problems)
Exercise 4.1: Left join
Scenario: Customers and orders. Left join orders to customers.
Difficulty: Beginner
Click to reveal solution
Explanation: left_join keeps all left rows. The named-vector by handles different key names.
Exercise 4.2: Anti-join for orphans
Scenario: Find orders with no matching customer.
Difficulty: Intermediate
Click to reveal solution
Explanation: anti_join returns rows of x with NO match in y. Standard data-quality check.
Exercise 4.3: Multi-key join
Scenario: Match sales to prices on (region, product).
Difficulty: Intermediate
Click to reveal solution
Explanation: Vector by does multi-column matching.
Exercise 4.4: Bind rows of multiple tibbles
Scenario: Three monthly reports with the same columns. Combine into one.
Difficulty: Beginner
Click to reveal solution
Explanation: bind_rows concatenates row-wise. With .id, adds a column tagging which input each row came from.
Exercise 4.5: Bind columns side by side
Scenario: Two tibbles with the same number of rows. Side-by-side combine.
Difficulty: Beginner
Click to reveal solution
Explanation: bind_cols requires matching row counts. Be cautious: it does NOT align by key, just by row position.
Exercise 4.6: Full join with coalesce
Scenario: Two snapshots of customer scores. Take all customers, prefer current scores when present, else previous.
Difficulty: Advanced
Click to reveal solution
Explanation: full_join gives every id; coalesce prefers the first non-NA argument. Standard "upsert" pattern.
Exercise 4.7: Inequality (rolling) join
Scenario: Map each transaction amount to the highest tier whose threshold is below or equal to it.
Difficulty: Advanced
Click to reveal solution
Explanation: dplyr 1.1+ join_by with closest supports inequality joins. Same pattern works for time-window matches.
Exercise 4.8: Find common and divergent rows
Scenario: Compare two snapshots of a customer table. Report which IDs are added (in curr only), removed (in prev only), and unchanged.
Difficulty: Advanced
Click to reveal solution
Explanation: Three operations: added/removed via anti_join in both directions; changed via inner_join + filter on inequality. Standard reconciliation pattern.
Section 5. Aggregate and summarise (8 problems)
Exercise 5.1: Mean per group
Scenario: Mean mpg per cyl in mtcars.
Difficulty: Beginner
Click to reveal solution
Explanation: Standard group + summarise. summarise collapses each group to one row.
Exercise 5.2: Multiple stats per group
Scenario: Per cyl: count, mean, SD, min, max of mpg.
Difficulty: Intermediate
Click to reveal solution
Explanation: Multiple aggregations in one summarise.
Exercise 5.3: Top N per group
Scenario: Top 2 highest-mpg cars per cyl.
Difficulty: Intermediate
Click to reveal solution
Explanation: slice_max with by = is dplyr 1.1+ shortcut.
Exercise 5.4: Count and percentage
Scenario: Count diamonds per cut, with percentage column.
Difficulty: Intermediate
Click to reveal solution
Explanation: count gives n; mutate adds pct from total. For per-group percentages, group_by before mutate.
Exercise 5.5: Group filter (HAVING-style)
Scenario: Keep only cyl groups with at least 10 cars; report mean mpg.
Difficulty: Intermediate
Click to reveal solution
Explanation: filter inside group_by uses n() for size. Equivalent to SQL HAVING.
Exercise 5.6: Within-group proportion
Scenario: For each cut, the proportion of each clarity (cells sum to 1 within cut).
Difficulty: Intermediate
Click to reveal solution
Explanation: Group then mutate gives per-group denominators.
Exercise 5.7: Weighted mean
Scenario: Weighted mean price by carat per cut.
Difficulty: Intermediate
Click to reveal solution
Explanation: weighted.mean(x, w) gives sum(x*w)/sum(w).
Exercise 5.8: Multi-level grouping
Scenario: Mean price per (cut, color) on diamonds.
Difficulty: Intermediate
Click to reveal solution
Explanation: Multiple group columns; .groups = "drop" releases grouping after summarise.
Section 6. End-to-end ETL (8 problems)
Exercise 6.1: Read, clean, aggregate
Scenario: Raw data has a string price like "$1,234.50". Clean and compute the mean.
Difficulty: Intermediate
Click to reveal solution
Explanation: readr::parse_number extracts numeric content from a string, ignoring currency symbols and thousands separators.
Exercise 6.2: Detect and remove duplicates with priority
Scenario: A customer table has dupes. Keep the one with the most recent updated_at per email.
Difficulty: Intermediate
Click to reveal solution
Explanation: Sort by priority then distinct keeps the first per group.
Exercise 6.3: Parse dates from mixed formats
Scenario: Date column has mixed "01/15/2024" and "2024-02-20" formats. Parse all to Date.
Difficulty: Advanced
Click to reveal solution
Explanation: parse_date_time tries each format. Wrap with as.Date if you don't need the time component.
Exercise 6.4: Wide to long with name parsing
Scenario: A wide table has columns q1_2023, q2_2023, etc. Pivot to long format with separate quarter and year columns.
Difficulty: Advanced
Click to reveal solution
Explanation: names_pattern with capture groups extracts both pieces of metadata in one pivot. Saves a separate step.
Exercise 6.5: Join + aggregate + sort
Scenario: Customers + orders + products. Compute total revenue per product category, sorted descending.
Difficulty: Advanced
Click to reveal solution
Explanation: Standard ETL: join product info, derive revenue, aggregate, sort.
Exercise 6.6: Reshape and join
Scenario: Wide quarterly sales table needs to be joined to a region info table. Pivot first.
Difficulty: Advanced
Click to reveal solution
Explanation: Pivot first (long is the join-friendly shape), then join.
Exercise 6.7: Profile a CSV programmatically
Scenario: Write a function that takes a tibble and returns: row count, col count, NA count, dup count, numeric col count.
Difficulty: Advanced
Click to reveal solution
Explanation: Reusable profiler in 5 lines. Foundation of audit reports.
Exercise 6.8: Full ETL slice
Scenario: Read messy raw data, clean, derive, join with lookup, aggregate. End-to-end.
Difficulty: Advanced
Click to reveal solution
Explanation: Five-package pipeline: stringr (clean strings), lubridate (mdy), readr (parse_number), dplyr (join + summarise + arrange), tibble. Recurs across every real ETL job.
What to do next
After 50 problems, the wrangling lifecycle should feel automatic. Natural follow-ups:
- Single-package hubs: dplyr-Exercises, ggplot2-Exercises, tidyverse-Exercises (shipped) for deeper drilling.
- Data-Cleaning-Exercises (coming): focused on dirty-data patterns.
- EDA-Exercises (coming): wrangling plus the analysis layer that follows it.