Data Cleaning Exercises in R: 50 Real Practice Problems

Fifty practice problems on data cleaning in R: handling missing values, duplicates, type coercion, outliers, validation, and end-to-end cleanup pipelines.

RRun this once before any exercise
library(dplyr) library(tidyr) library(stringr) library(lubridate) library(tibble) library(readr)

  

Section 1. Missing values (8 problems)

Exercise 1.1: Total NAs

Difficulty: Beginner.

Show solution
RInteractive R
sum(is.na(airquality))

  

Exercise 1.2: NAs per column

Difficulty: Intermediate.

Show solution
RInteractive R
airquality |> summarise(across(everything(), ~ sum(is.na(.x))))

  

Exercise 1.3: Drop rows with any NA

Difficulty: Beginner.

Show solution
RInteractive R
drop_na(airquality)

  

Exercise 1.4: Drop rows with NA in target

Difficulty: Intermediate.

Show solution
RInteractive R
drop_na(airquality, Ozone)

  

Exercise 1.5: Replace NA with 0

Difficulty: Beginner.

Show solution
RInteractive R
airquality |> mutate(Ozone = replace_na(Ozone, 0))

  

Exercise 1.6: Mean impute

Difficulty: Intermediate.

Show solution
RInteractive R
airquality |> mutate(Ozone = if_else(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))

  

Exercise 1.7: Median impute per group

Difficulty: Advanced.

Show solution
RInteractive R
airquality |> group_by(Month) |> mutate(Ozone = if_else(is.na(Ozone), median(Ozone, na.rm = TRUE), Ozone)) |> ungroup()

  

Exercise 1.8: Forward fill

Difficulty: Intermediate.

Show solution
RInteractive R
tibble(x = c(1, NA, NA, 4, NA, 6)) |> fill(x)

  

Section 2. Duplicates (6 problems)

Exercise 2.1: Count duplicate rows

Difficulty: Beginner.

Show solution
RInteractive R
sum(duplicated(diamonds))

  

Exercise 2.2: Drop full duplicates

Difficulty: Beginner.

Show solution
RInteractive R
diamonds |> distinct()

  

Exercise 2.3: Drop dupes by key

Difficulty: Intermediate.

Show solution
RInteractive R
df <- tibble(email = c("a@x","b@x","a@x"), name = c("A","B","A2")) df |> distinct(email, .keep_all = TRUE)

  

Exercise 2.4: Detect duplicate keys

Difficulty: Intermediate.

Show solution
RInteractive R
df <- tibble(email = c("a@x","b@x","a@x","c@x"), n = 1:4) df |> group_by(email) |> filter(n() > 1)

  

Exercise 2.5: Dedupe with priority rule

Difficulty: Advanced. Keep most recent per email.

Show solution
RInteractive R
df <- tibble(email = c("a@x","b@x","a@x"), date = as.Date(c("2024-01-01","2024-02-01","2024-03-01"))) df |> arrange(desc(date)) |> distinct(email, .keep_all = TRUE)

  

Exercise 2.6: Fuzzy duplicates by normalized key

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(name = c(" Alice ","alice","BOB","bob ")) df |> mutate(key = str_to_lower(str_trim(name))) |> distinct(key, .keep_all = TRUE)

  

Section 3. Type coercion (8 problems)

Exercise 3.1: Character to numeric

Difficulty: Beginner.

Show solution
RInteractive R
as.numeric(c("1.5","2.7","3"))

  

Exercise 3.2: Strip currency before parsing

Difficulty: Intermediate.

Show solution
RInteractive R
readr::parse_number("$1,234.50")

  

Exercise 3.3: Logical from yes/no

Difficulty: Intermediate.

Show solution
RInteractive R
v <- c("yes","no","y","n") v %in% c("yes","y")

  

Exercise 3.4: Date from string

Difficulty: Intermediate.

Show solution
RInteractive R
as.Date(c("2024-01-15","2024-02-20"))

  

Exercise 3.5: Mixed-format dates

Difficulty: Advanced.

Show solution
RInteractive R
parse_date_time(c("2024-01-15","01/15/2024"), orders = c("ymd","mdy"))

  

Exercise 3.6: Factor from character

Difficulty: Beginner.

Show solution
RInteractive R
factor(c("low","high","med"), levels = c("low","med","high"))

  

Exercise 3.7: Cleanup a column with mixed garbage

Difficulty: Advanced.

Show solution
RInteractive R
v <- c("1","2.5","abc","NA","") suppressWarnings(as.numeric(v)) # NAs for non-numeric

  

Exercise 3.8: Coerce all numeric-like in a tibble

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(a = c("1","2","3"), b = c("x","y","z"), c = c("1.5","2.5","NA")) df |> mutate(across(c(a, c), ~ suppressWarnings(as.numeric(.x))))

  

Section 4. Strings cleanup (8 problems)

Exercise 4.1: Trim whitespace

Difficulty: Beginner.

Show solution
RInteractive R
str_trim(c(" Alice ","Bob "))

  

Exercise 4.2: Squish multiple spaces

Difficulty: Intermediate.

Show solution
RInteractive R
str_squish(" hello world ")

  

Exercise 4.3: Standardize case

Difficulty: Beginner.

Show solution
RInteractive R
str_to_lower(c("ALICE","Bob","carol"))

  

Exercise 4.4: Remove punctuation

Difficulty: Intermediate.

Show solution
RInteractive R
str_replace_all("Hello, world!", "[[:punct:]]", "")

  

Exercise 4.5: Standardize categorical

Difficulty: Intermediate. Map "USA","us","United States" -> "US".

Show solution
RInteractive R
v <- c("USA","us","United States","Canada") case_when(v %in% c("USA","us","United States","U.S.A.") ~ "US", TRUE ~ v)

  

Exercise 4.6: Remove stopwords (basic)

Difficulty: Advanced.

Show solution
RInteractive R
stop <- c("the","a","is","to","and") clean <- function(s) { words <- str_split(s, " ", simplify = TRUE) paste(words[!words %in% stop], collapse = " ") } clean("the cat is on the mat")

  

Exercise 4.7: Detect non-ASCII

Difficulty: Advanced.

Show solution
RInteractive R
str_detect("café", "[^[:ascii:]]")

  

Exercise 4.8: Normalize encoding

Difficulty: Advanced.

Show solution
RInteractive R
iconv("café", from = "UTF-8", to = "ASCII//TRANSLIT")

  

Section 5. Outliers (6 problems)

Exercise 5.1: IQR rule

Difficulty: Intermediate.

Show solution
RInteractive R
mtcars |> mutate(out = { q <- quantile(mpg, c(0.25, 0.75)) mpg < q[1] - 1.5*IQR(mpg) | mpg > q[2] + 1.5*IQR(mpg) })

  

Exercise 5.2: Z-score rule

Difficulty: Intermediate.

Show solution
RInteractive R
mtcars |> mutate(z = scale(mpg)[,1], out = abs(z) > 3)

  

Exercise 5.3: Per-group outliers

Difficulty: Advanced.

Show solution
RInteractive R
mtcars |> group_by(cyl) |> mutate(z = scale(mpg)[,1], out = abs(z) > 2) |> ungroup()

  

Exercise 5.4: Winsorize 5/95

Difficulty: Intermediate.

Show solution
RInteractive R
q <- quantile(mtcars$mpg, c(0.05, 0.95)) mtcars |> mutate(mpg = pmin(pmax(mpg, q[1]), q[2]))

  

Exercise 5.5: Cap at 99th percentile

Difficulty: Intermediate.

Show solution
RInteractive R
cap <- quantile(diamonds$price, 0.99) diamonds |> mutate(price = pmin(price, cap))

  

Exercise 5.6: Drop outliers in target column

Difficulty: Advanced.

Show solution
RInteractive R
mtcars |> filter({ q <- quantile(mpg, c(0.25, 0.75)) mpg >= q[1] - 1.5*IQR(mpg) & mpg <= q[2] + 1.5*IQR(mpg) })

  

Section 6. Validation (6 problems)

Exercise 6.1: Range check

Difficulty: Beginner. Age 0-120.

Show solution
RInteractive R
df <- tibble(age = c(25, -5, 130, 40)) df |> mutate(valid_age = age >= 0 & age <= 120)

  

Exercise 6.2: Email contains "@"

Difficulty: Beginner.

Show solution
RInteractive R
df <- tibble(email = c("a@x.com","not_an_email","b@y.com")) df |> mutate(valid = str_detect(email, "@"))

  

Exercise 6.3: Multi-rule validation

Difficulty: Intermediate.

Show solution
RInteractive R
df <- tibble(age = c(25, -5, 30), email = c("a@x","b","c@y")) df |> mutate(valid = age >= 0 & age <= 120 & str_detect(email, "@"))

  

Exercise 6.4: Required-non-NA check

Difficulty: Intermediate.

Show solution
RInteractive R
df <- tibble(id = c(1, 2, NA), name = c("A","B","C")) df |> mutate(valid = !is.na(id))

  

Exercise 6.5: Cross-column rule

Difficulty: Advanced. start <= end.

Show solution
RInteractive R
df <- tibble(start = as.Date(c("2024-01-01","2024-03-01")), end = as.Date(c("2024-02-01","2024-02-15"))) df |> mutate(valid = start <= end)

  

Exercise 6.6: Schema-style validation report

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(age = c(25, -5, 130), email = c("a@x","b","c@y")) report <- df |> mutate(invalid_age = age < 0 | age > 120, invalid_email = !str_detect(email, "@")) |> filter(invalid_age | invalid_email) report

  

Section 7. End-to-end cleaning (8 problems)

Exercise 7.1: Clean phone numbers

Difficulty: Intermediate.

Show solution
RInteractive R
phones <- c("(415) 555-1234","415.555.1234","415 555 1234") str_replace_all(phones, "\\D", "")

  

Exercise 7.2: Standardize country names

Difficulty: Intermediate.

Show solution
RInteractive R
v <- c("USA","us","United States","UK","United Kingdom") case_when(v %in% c("USA","us","United States") ~ "US", v %in% c("UK","United Kingdom") ~ "GB", TRUE ~ v)

  

Exercise 7.3: Parse currency strings

Difficulty: Intermediate.

Show solution
RInteractive R
readr::parse_number(c("$1,234.50","€999.99","£12.34"))

  

Exercise 7.4: Pivot then clean

Difficulty: Advanced. Wide -> long -> drop NAs.

Show solution
RInteractive R
wide <- tibble(id = 1:2, a = c(1, NA), b = c(2, 3)) wide |> pivot_longer(-id, values_drop_na = TRUE)

  

Exercise 7.5: Trim and lowercase a key column

Difficulty: Beginner.

Show solution
RInteractive R
df <- tibble(name = c(" Alice ","BOB","carol")) df |> mutate(name = str_to_lower(str_trim(name)))

  

Exercise 7.6: Multi-step pipeline

Difficulty: Advanced.

Show solution
RInteractive R
raw <- tibble(name = c(" Alice ","BOB","alice"), date = c("01/15/2024","02/20/2024","03/05/2024"), amount = c("$50","$80","$30")) raw |> mutate(name = str_to_lower(str_trim(name)), date = mdy(date), amount = readr::parse_number(amount)) |> distinct(name, .keep_all = TRUE)

  

Exercise 7.7: Validate then split valid/invalid

Difficulty: Advanced.

Show solution
RInteractive R
df <- tibble(age = c(25, -5, 30, 200), email = c("a@x","b","c@y","d@z")) df <- df |> mutate(valid = age >= 0 & age <= 120 & str_detect(email, "@")) list(valid = filter(df, valid), invalid = filter(df, !valid))

  

Exercise 7.8: Reusable cleaning function

Difficulty: Advanced.

Show solution
RInteractive R
clean_text <- function(x) { x |> str_trim() |> str_squish() |> str_to_lower() } clean_text(c(" Alice ","BOB "," carol "))

  

What to do next

  • Data-Wrangling-Exercises (shipped), broader wrangling lifecycle.
  • EDA-Exercises (shipped), explore the now-clean data.