Home › Data Cleaning Exercises in R: 50 Real Practice Problems
Data Cleaning Exercises in R: 50 Real Practice Problems
Fifty practice problems on data cleaning in R: handling missing values , duplicates, type coercion , outliers, validation, and end-to-end cleanup pipelines.
By Selva Prabhakaran · Published May 11, 2026 · Last updated May 11, 2026
library (dplyr)
library (tidyr)
library (stringr)
library (lubridate)
library (tibble)
library (readr)
▶ Run
↺ Reset
Section 1. Missing values (8 problems)
Exercise 1.1: Total NAs
Difficulty: Beginner.
Show solution
sum (is.na (airquality))
▶ Run
↺ Reset
Exercise 1.2: NAs per column
Difficulty: Intermediate.
Show solution
airquality |>
summarise (across (everything (), ~ sum (is.na (.x))))
▶ Run
↺ Reset
Exercise 1.3: Drop rows with any NA
Difficulty: Beginner.
Show solution
drop_na (airquality)
▶ Run
↺ Reset
Exercise 1.4: Drop rows with NA in target
Difficulty: Intermediate.
Show solution
drop_na (airquality, Ozone)
▶ Run
↺ Reset
Exercise 1.5: Replace NA with 0
Difficulty: Beginner.
Show solution
airquality |> mutate (Ozone = replace_na (Ozone, 0 ))
▶ Run
↺ Reset
Exercise 1.6: Mean impute
Difficulty: Intermediate.
Show solution
airquality |>
mutate (Ozone = if_else (is.na (Ozone), mean (Ozone, na.rm = TRUE ), Ozone))
▶ Run
↺ Reset
Exercise 1.7: Median impute per group
Difficulty: Advanced.
Show solution
airquality |>
group_by (Month) |>
mutate (Ozone = if_else (is.na (Ozone), median (Ozone, na.rm = TRUE ), Ozone)) |>
ungroup ()
▶ Run
↺ Reset
Exercise 1.8: Forward fill
Difficulty: Intermediate.
Show solution
tibble (x = c (1 , NA , NA , 4 , NA , 6 )) |> fill (x)
▶ Run
↺ Reset
Section 2. Duplicates (6 problems)
Exercise 2.1: Count duplicate rows
Difficulty: Beginner.
Show solution
sum (duplicated (diamonds))
▶ Run
↺ Reset
Exercise 2.2: Drop full duplicates
Difficulty: Beginner.
Show solution
diamonds |> distinct ()
▶ Run
↺ Reset
Exercise 2.3: Drop dupes by key
Difficulty: Intermediate.
Show solution
df <- tibble (email = c ("a@x" ,"b@x" ,"a@x" ), name = c ("A" ,"B" ,"A2" ))
df |> distinct (email, .keep_all = TRUE )
▶ Run
↺ Reset
Exercise 2.4: Detect duplicate keys
Difficulty: Intermediate.
Show solution
df <- tibble (email = c ("a@x" ,"b@x" ,"a@x" ,"c@x" ), n = 1 : 4 )
df |> group_by (email) |> filter (n () > 1 )
▶ Run
↺ Reset
Exercise 2.5: Dedupe with priority rule
Difficulty: Advanced. Keep most recent per email.
Show solution
df <- tibble (email = c ("a@x" ,"b@x" ,"a@x" ),
date = as.Date (c ("2024-01-01" ,"2024-02-01" ,"2024-03-01" )))
df |> arrange (desc (date)) |> distinct (email, .keep_all = TRUE )
▶ Run
↺ Reset
Exercise 2.6: Fuzzy duplicates by normalized key
Difficulty: Advanced.
Show solution
df <- tibble (name = c (" Alice " ,"alice" ,"BOB" ,"bob " ))
df |> mutate (key = str_to_lower (str_trim (name))) |> distinct (key, .keep_all = TRUE )
▶ Run
↺ Reset
Section 3. Type coercion (8 problems)
Exercise 3.1: Character to numeric
Difficulty: Beginner.
Show solution
as.numeric (c ("1.5" ,"2.7" ,"3" ))
▶ Run
↺ Reset
Exercise 3.2: Strip currency before parsing
Difficulty: Intermediate.
Show solution
readr:: parse_number ("$1,234.50" )
▶ Run
↺ Reset
Exercise 3.3: Logical from yes/no
Difficulty: Intermediate.
Show solution
v <- c ("yes" ,"no" ,"y" ,"n" )
v %in% c ("yes" ,"y" )
▶ Run
↺ Reset
Exercise 3.4: Date from string
Difficulty: Intermediate.
Show solution
as.Date (c ("2024-01-15" ,"2024-02-20" ))
▶ Run
↺ Reset
Exercise 3.5: Mixed-format dates
Difficulty: Advanced.
Show solution
parse_date_time (c ("2024-01-15" ,"01/15/2024" ), orders = c ("ymd" ,"mdy" ))
▶ Run
↺ Reset
Exercise 3.6: Factor from character
Difficulty: Beginner.
Show solution
factor (c ("low" ,"high" ,"med" ), levels = c ("low" ,"med" ,"high" ))
▶ Run
↺ Reset
Exercise 3.7: Cleanup a column with mixed garbage
Difficulty: Advanced.
Show solution
v <- c ("1" ,"2.5" ,"abc" ,"NA" ,"" )
suppressWarnings (as.numeric (v)) # NAs for non-numeric
▶ Run
↺ Reset
Exercise 3.8: Coerce all numeric-like in a tibble
Difficulty: Advanced.
Show solution
df <- tibble (a = c ("1" ,"2" ,"3" ), b = c ("x" ,"y" ,"z" ), c = c ("1.5" ,"2.5" ,"NA" ))
df |> mutate (across (c (a, c), ~ suppressWarnings (as.numeric (.x))))
▶ Run
↺ Reset
Section 4. Strings cleanup (8 problems)
Exercise 4.1: Trim whitespace
Difficulty: Beginner.
Show solution
str_trim (c (" Alice " ,"Bob " ))
▶ Run
↺ Reset
Exercise 4.2: Squish multiple spaces
Difficulty: Intermediate.
Show solution
str_squish (" hello world " )
▶ Run
↺ Reset
Exercise 4.3: Standardize case
Difficulty: Beginner.
Show solution
str_to_lower (c ("ALICE" ,"Bob" ,"carol" ))
▶ Run
↺ Reset
Exercise 4.4: Remove punctuation
Difficulty: Intermediate.
Show solution
str_replace_all ("Hello, world!" , "[[:punct:]]" , "" )
▶ Run
↺ Reset
Exercise 4.5: Standardize categorical
Difficulty: Intermediate. Map "USA","us","United States" -> "US".
Show solution
v <- c ("USA" ,"us" ,"United States" ,"Canada" )
case_when (v %in% c ("USA" ,"us" ,"United States" ,"U.S.A." ) ~ "US" ,
TRUE ~ v)
▶ Run
↺ Reset
Exercise 4.6: Remove stopwords (basic)
Difficulty: Advanced.
Show solution
stop <- c ("the" ,"a" ,"is" ,"to" ,"and" )
clean <- function (s) {
words <- str_split (s, " " , simplify = TRUE )
paste (words[! words %in% stop], collapse = " " )
}
clean ("the cat is on the mat" )
▶ Run
↺ Reset
Exercise 4.7: Detect non-ASCII
Difficulty: Advanced.
Show solution
str_detect ("café" , "[^[:ascii:]]" )
▶ Run
↺ Reset
Exercise 4.8: Normalize encoding
Difficulty: Advanced.
Show solution
iconv ("café" , from = "UTF-8" , to = "ASCII//TRANSLIT" )
▶ Run
↺ Reset
Section 5. Outliers (6 problems)
Exercise 5.1: IQR rule
Difficulty: Intermediate.
Show solution
mtcars |>
mutate (out = {
q <- quantile (mpg, c (0.25 , 0.75 ))
mpg < q[1 ] - 1.5 * IQR (mpg) | mpg > q[2 ] + 1.5 * IQR (mpg)
})
▶ Run
↺ Reset
Exercise 5.2: Z-score rule
Difficulty: Intermediate.
Show solution
mtcars |> mutate (z = scale (mpg)[,1 ], out = abs (z) > 3 )
▶ Run
↺ Reset
Exercise 5.3: Per-group outliers
Difficulty: Advanced.
Show solution
mtcars |>
group_by (cyl) |>
mutate (z = scale (mpg)[,1 ], out = abs (z) > 2 ) |>
ungroup ()
▶ Run
↺ Reset
Exercise 5.4: Winsorize 5/95
Difficulty: Intermediate.
Show solution
q <- quantile (mtcars$ mpg, c (0.05 , 0.95 ))
mtcars |> mutate (mpg = pmin (pmax (mpg, q[1 ]), q[2 ]))
▶ Run
↺ Reset
Exercise 5.5: Cap at 99th percentile
Difficulty: Intermediate.
Show solution
cap <- quantile (diamonds$ price, 0.99 )
diamonds |> mutate (price = pmin (price, cap))
▶ Run
↺ Reset
Exercise 5.6: Drop outliers in target column
Difficulty: Advanced.
Show solution
mtcars |> filter ({
q <- quantile (mpg, c (0.25 , 0.75 ))
mpg >= q[1 ] - 1.5 * IQR (mpg) & mpg <= q[2 ] + 1.5 * IQR (mpg)
})
▶ Run
↺ Reset
Section 6. Validation (6 problems)
Exercise 6.1: Range check
Difficulty: Beginner. Age 0-120.
Show solution
df <- tibble (age = c (25 , -5 , 130 , 40 ))
df |> mutate (valid_age = age >= 0 & age <= 120 )
▶ Run
↺ Reset
Exercise 6.2: Email contains "@"
Difficulty: Beginner.
Show solution
df <- tibble (email = c ("a@x.com" ,"not_an_email" ,"b@y.com" ))
df |> mutate (valid = str_detect (email, "@" ))
▶ Run
↺ Reset
Exercise 6.3: Multi-rule validation
Difficulty: Intermediate.
Show solution
df <- tibble (age = c (25 , -5 , 30 ), email = c ("a@x" ,"b" ,"c@y" ))
df |>
mutate (valid = age >= 0 & age <= 120 & str_detect (email, "@" ))
▶ Run
↺ Reset
Exercise 6.4: Required-non-NA check
Difficulty: Intermediate.
Show solution
df <- tibble (id = c (1 , 2 , NA ), name = c ("A" ,"B" ,"C" ))
df |> mutate (valid = ! is.na (id))
▶ Run
↺ Reset
Exercise 6.5: Cross-column rule
Difficulty: Advanced. start <= end.
Show solution
df <- tibble (start = as.Date (c ("2024-01-01" ,"2024-03-01" )),
end = as.Date (c ("2024-02-01" ,"2024-02-15" )))
df |> mutate (valid = start <= end)
▶ Run
↺ Reset
Exercise 6.6: Schema-style validation report
Difficulty: Advanced.
Show solution
df <- tibble (age = c (25 , -5 , 130 ), email = c ("a@x" ,"b" ,"c@y" ))
report <- df |>
mutate (invalid_age = age < 0 | age > 120 ,
invalid_email = ! str_detect (email, "@" )) |>
filter (invalid_age | invalid_email)
report
▶ Run
↺ Reset
Section 7. End-to-end cleaning (8 problems)
Exercise 7.1: Clean phone numbers
Difficulty: Intermediate.
Show solution
phones <- c ("(415) 555-1234" ,"415.555.1234" ,"415 555 1234" )
str_replace_all (phones, "\\D" , "" )
▶ Run
↺ Reset
Exercise 7.2: Standardize country names
Difficulty: Intermediate.
Show solution
v <- c ("USA" ,"us" ,"United States" ,"UK" ,"United Kingdom" )
case_when (v %in% c ("USA" ,"us" ,"United States" ) ~ "US" ,
v %in% c ("UK" ,"United Kingdom" ) ~ "GB" ,
TRUE ~ v)
▶ Run
↺ Reset
Exercise 7.3: Parse currency strings
Difficulty: Intermediate.
Show solution
readr:: parse_number (c ("$1,234.50" ,"€999.99" ,"£12.34" ))
▶ Run
↺ Reset
Exercise 7.4: Pivot then clean
Difficulty: Advanced. Wide -> long -> drop NAs.
Show solution
wide <- tibble (id = 1 : 2 , a = c (1 , NA ), b = c (2 , 3 ))
wide |> pivot_longer (- id, values_drop_na = TRUE )
▶ Run
↺ Reset
Exercise 7.5: Trim and lowercase a key column
Difficulty: Beginner.
Show solution
df <- tibble (name = c (" Alice " ,"BOB" ,"carol" ))
df |> mutate (name = str_to_lower (str_trim (name)))
▶ Run
↺ Reset
Exercise 7.6: Multi-step pipeline
Difficulty: Advanced.
Show solution
raw <- tibble (name = c (" Alice " ,"BOB" ,"alice" ),
date = c ("01/15/2024" ,"02/20/2024" ,"03/05/2024" ),
amount = c ("$50" ,"$80" ,"$30" ))
raw |>
mutate (name = str_to_lower (str_trim (name)),
date = mdy (date),
amount = readr:: parse_number (amount)) |>
distinct (name, .keep_all = TRUE )
▶ Run
↺ Reset
Exercise 7.7: Validate then split valid/invalid
Difficulty: Advanced.
Show solution
df <- tibble (age = c (25 , -5 , 30 , 200 ), email = c ("a@x" ,"b" ,"c@y" ,"d@z" ))
df <- df |> mutate (valid = age >= 0 & age <= 120 & str_detect (email, "@" ))
list (valid = filter (df, valid), invalid = filter (df, ! valid))
▶ Run
↺ Reset
Exercise 7.8: Reusable cleaning function
Difficulty: Advanced.
Show solution
clean_text <- function (x) {
x |> str_trim () |> str_squish () |> str_to_lower ()
}
clean_text (c (" Alice " ,"BOB " ," carol " ))
▶ Run
↺ Reset
What to do next
Data-Wrangling-Exercises (shipped), broader wrangling lifecycle.
EDA-Exercises (shipped), explore the now-clean data.