EDA Exercises in R: 50 Real Practice Problems

Fifty exploratory data analysis exercises spanning inspection, distributions, missing values, outliers, relationships, and full EDA workflows. Hidden solutions, runnable code.

RRun this once before any exercise
library(dplyr) library(tidyr) library(ggplot2) library(tibble) library(e1071) library(ggridges) library(naniar)

  

Section 1. Data inspection (8 problems)

Exercise 1.1: Dimensions

Difficulty: Beginner. Get the row and column count of airquality.

Show solution
RInteractive R
dim(airquality)

  

Exercise 1.2: Glimpse columns

Difficulty: Beginner. Inspect column types and sample values of mtcars with glimpse.

Show solution
RInteractive R
glimpse(mtcars)

  

Exercise 1.3: Summary statistics

Difficulty: Beginner. Print summary of airquality and identify the column with the most NAs.

Show solution
RInteractive R
summary(airquality) # Ozone has the most NAs (37)

  

Exercise 1.4: First 10 rows

Difficulty: Beginner. Inspect the first 10 rows of diamonds.

Show solution
RInteractive R
head(diamonds, 10)

  

Exercise 1.5: Class of each column

Difficulty: Intermediate. Get the class of each column of iris as a named vector.

Show solution
RInteractive R
sapply(iris, class)

  

Exercise 1.6: Number of distinct values per column

Difficulty: Intermediate. For diamonds, count distinct values per column.

Show solution
RInteractive R
diamonds |> summarise(across(everything(), n_distinct)) |> pivot_longer(everything())

  

Exercise 1.7: Range of each numeric column

Difficulty: Intermediate. Get min and max for each numeric column of iris.

Show solution
RInteractive R
iris |> summarise(across(where(is.numeric), list(min = min, max = max), .names = "{.col}_{.fn}"))

  

Exercise 1.8: Build a one-shot data profile

Difficulty: Advanced. For airquality, return a tibble with column name, class, NA count, distinct count.

Show solution
RInteractive R
tibble( column = names(airquality), class = sapply(airquality, function(x) class(x)[1]), n_na = sapply(airquality, function(x) sum(is.na(x))), n_distinct = sapply(airquality, n_distinct) )

  

Section 2. Distributions (10 problems)

Exercise 2.1: Histogram of mpg

Difficulty: Beginner. Histogram of mtcars$mpg with 15 bins.

Show solution
RInteractive R
ggplot(mtcars, aes(mpg)) + geom_histogram(bins = 15)

  

Exercise 2.2: Density curve

Difficulty: Beginner. Density curve of diamonds$price.

Show solution
RInteractive R
ggplot(diamonds, aes(price)) + geom_density()

  

Exercise 2.3: Boxplot per group

Difficulty: Intermediate. Boxplot of Sepal.Length by Species.

Show solution
RInteractive R
ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot()

  

Exercise 2.4: Overlapping densities

Difficulty: Intermediate. Overlapping density plot of Sepal.Length, colored by Species, with alpha.

Show solution
RInteractive R
ggplot(iris, aes(Sepal.Length, fill = Species)) + geom_density(alpha = 0.5)

  

Exercise 2.5: Quintiles

Difficulty: Intermediate. Compute the 20th, 40th, 60th, 80th percentiles of mtcars$mpg.

Show solution
RInteractive R
quantile(mtcars$mpg, c(0.2, 0.4, 0.6, 0.8))

  

Exercise 2.6: Skewness and kurtosis

Difficulty: Advanced. Compute skewness and kurtosis of diamonds$price.

Show solution
RInteractive R
e1071::skewness(diamonds$price) e1071::kurtosis(diamonds$price) # Right-skewed (positive); heavy-tailed (kurt > 3)

  

Exercise 2.7: Log-transform a skewed variable

Difficulty: Intermediate. Plot log(price) histogram and observe the difference.

Show solution
RInteractive R
ggplot(diamonds, aes(log(price))) + geom_histogram(bins = 30)

  

Exercise 2.8: Histograms by facet

Difficulty: Intermediate. Histogram of price faceted by cut.

Show solution
RInteractive R
ggplot(diamonds, aes(price)) + geom_histogram(bins = 30) + facet_wrap(~ cut)

  

Exercise 2.9: Empirical CDF

Difficulty: Advanced. Plot the empirical CDF of mtcars$mpg.

Show solution
RInteractive R
ggplot(mtcars, aes(mpg)) + stat_ecdf()

  

Exercise 2.10: Compare distribution shape across groups

Difficulty: Advanced. Use ridgeline plots (ggridges) for diamond price by cut.

Show solution
RInteractive R
ggplot(diamonds, aes(x = price, y = cut, fill = cut)) + ggridges::geom_density_ridges(alpha = 0.6)

  

Section 3. Missing data (6 problems)

Exercise 3.1: Count NAs

Difficulty: Beginner. Total NA count in airquality.

Show solution
RInteractive R
sum(is.na(airquality))

  

Exercise 3.2: NA per column

Difficulty: Intermediate. NAs per column, sorted desc.

Show solution
RInteractive R
airquality |> summarise(across(everything(), ~ sum(is.na(.x)))) |> pivot_longer(everything()) |> arrange(desc(value))

  

Exercise 3.3: NA per row

Difficulty: Intermediate. Add a n_na column per row to airquality.

Show solution
RInteractive R
airquality |> mutate(n_na = rowSums(is.na(across(everything()))))

  

Exercise 3.4: Drop incomplete rows

Difficulty: Beginner. Remove rows with any NA.

Show solution
RInteractive R
drop_na(airquality)

  

Exercise 3.5: Visualize NA pattern

Difficulty: Advanced. Use naniar::vis_miss to visualize the missingness pattern.

Show solution
RInteractive R
naniar::vis_miss(airquality)

  

Exercise 3.6: Mean impute and document

Difficulty: Intermediate. Impute Ozone NAs with the column mean and add a flag column.

Show solution
RInteractive R
airquality |> mutate(was_na = is.na(Ozone), Ozone_imp = if_else(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))

  

Section 4. Outliers (6 problems)

Exercise 4.1: Tukey IQR rule

Difficulty: Intermediate. Flag mpg outliers using Q1 - 1.5IQR / Q3 + 1.5IQR.

Show solution
RInteractive R
mtcars |> mutate(out = { q <- quantile(mpg, c(0.25, 0.75)) mpg < q[1] - 1.5*IQR(mpg) | mpg > q[2] + 1.5*IQR(mpg) })

  

Exercise 4.2: Z-score rule

Difficulty: Intermediate. Flag rows where |z| > 3 for mpg.

Show solution
RInteractive R
mtcars |> mutate(z = (mpg - mean(mpg)) / sd(mpg), out = abs(z) > 3)

  

Exercise 4.3: Per-group outliers

Difficulty: Advanced. Flag mpg outliers within each cyl group.

Show solution
RInteractive R
mtcars |> group_by(cyl) |> mutate(out = { q <- quantile(mpg, c(0.25, 0.75)) mpg < q[1] - 1.5*IQR(mpg) | mpg > q[2] + 1.5*IQR(mpg) }) |> ungroup()

  

Exercise 4.4: Visualize outliers in a boxplot

Difficulty: Beginner. Boxplot of diamonds$price.

Show solution
RInteractive R
ggplot(diamonds, aes(y = price)) + geom_boxplot()

  

Exercise 4.5: Winsorize

Difficulty: Intermediate. Cap mpg at the 5th and 95th percentiles.

Show solution
RInteractive R
q <- quantile(mtcars$mpg, c(0.05, 0.95)) mtcars |> mutate(mpg_w = pmin(pmax(mpg, q[1]), q[2]))

  

Exercise 4.6: Robust scale alternative

Difficulty: Advanced. Standardize using median + MAD instead of mean + sd.

Show solution
RInteractive R
mtcars |> mutate(mpg_robust = (mpg - median(mpg)) / mad(mpg))

  

Section 5. Relationships (10 problems)

Exercise 5.1: Pearson correlation

Difficulty: Beginner. Correlation between wt and mpg.

Show solution
RInteractive R
cor(mtcars$wt, mtcars$mpg)

  

Exercise 5.2: Correlation matrix

Difficulty: Intermediate. Correlation matrix of mtcars (numeric).

Show solution
RInteractive R
cor(mtcars)

  

Exercise 5.3: Visualize correlation matrix

Difficulty: Intermediate. Heatmap of the correlation matrix.

Show solution
RInteractive R
cor(mtcars) |> as.data.frame() |> rownames_to_column("var1") |> pivot_longer(-var1, names_to = "var2", values_to = "cor") |> ggplot(aes(var1, var2, fill = cor)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0)

  

Exercise 5.4: Spearman vs Pearson

Difficulty: Intermediate. Compare Pearson and Spearman correlation between disp and mpg.

Show solution
RInteractive R
cor(mtcars$disp, mtcars$mpg, method = "pearson") cor(mtcars$disp, mtcars$mpg, method = "spearman") # Spearman captures monotonic non-linear; Pearson assumes linear

  

Exercise 5.5: Scatter with smoother

Difficulty: Intermediate. Scatter wt vs mpg with linear smoother.

Show solution
RInteractive R
ggplot(mtcars, aes(wt, mpg)) + geom_point() + geom_smooth(method = "lm")

  

Exercise 5.6: Pairs plot

Difficulty: Intermediate. Pairs plot of iris numeric columns colored by Species.

Show solution
RInteractive R
GGally::ggpairs(iris, columns = 1:4, aes(color = Species))

  

Exercise 5.7: Categorical-categorical

Difficulty: Intermediate. Cross-tabulation of cut and clarity in diamonds.

Show solution
RInteractive R
table(diamonds$cut, diamonds$clarity)

  

Exercise 5.8: Categorical-numeric

Difficulty: Intermediate. Mean price per cut (categorical-numeric exploration).

Show solution
RInteractive R
diamonds |> group_by(cut) |> summarise(mean_price = mean(price))

  

Exercise 5.9: Conditional density

Difficulty: Advanced. Density of mpg conditional on factor(cyl).

Show solution
RInteractive R
ggplot(mtcars, aes(mpg, fill = factor(cyl))) + geom_density(alpha = 0.5)

  

Exercise 5.10: Mosaic plot

Difficulty: Advanced. Mosaic plot of cut x clarity proportions.

Show solution
RInteractive R
table(diamonds$cut, diamonds$clarity) |> mosaicplot()

  

Section 6. End-to-end EDA (10 problems)

Exercise 6.1: Initial profile

Difficulty: Intermediate. Run a 3-step opening EDA on diamonds: dim, glimpse, summary.

Show solution
RInteractive R
dim(diamonds); glimpse(diamonds); summary(diamonds)

  

Exercise 6.2: Find a categorical with imbalanced frequencies

Difficulty: Intermediate. Identify any column where the most frequent value is > 50% of rows.

Show solution
RInteractive R
diamonds |> summarise(across(c(cut, color, clarity), ~ max(prop.table(table(.x))) > 0.5)) |> pivot_longer(everything())

  

Exercise 6.3: Detect a heavily-skewed numeric

Difficulty: Advanced. Find numeric columns with skewness > 1.

Show solution
RInteractive R
diamonds |> summarise(across(where(is.numeric), e1071::skewness)) |> pivot_longer(everything()) |> filter(value > 1)

  

Exercise 6.4: Numeric summary by group

Difficulty: Intermediate. Per Species, give n, mean, sd, min, max of Sepal.Length.

Show solution
RInteractive R
iris |> group_by(Species) |> summarise(n = n(), mean = mean(Sepal.Length), sd = sd(Sepal.Length), min = min(Sepal.Length), max = max(Sepal.Length))

  

Exercise 6.5: Top correlations

Difficulty: Advanced. Find the top 3 most-correlated pairs in mtcars (excluding self).

Show solution
RInteractive R
cor(mtcars) |> as.data.frame() |> rownames_to_column("var1") |> pivot_longer(-var1, names_to = "var2", values_to = "cor") |> filter(var1 < var2) |> arrange(desc(abs(cor))) |> head(3)

  

Exercise 6.6: Detect duplicates

Difficulty: Intermediate. Count fully-duplicate rows in diamonds.

Show solution
RInteractive R
sum(duplicated(diamonds))

  

Exercise 6.7: One-way summary

Difficulty: Intermediate. Mean and N per cyl group with arrange.

Show solution
RInteractive R
mtcars |> group_by(cyl) |> summarise(n = n(), mean_mpg = mean(mpg)) |> arrange(desc(mean_mpg))

  

Exercise 6.8: Two-way summary

Difficulty: Intermediate. Mean price per (cut, color) in diamonds.

Show solution
RInteractive R
diamonds |> group_by(cut, color) |> summarise(mean_price = mean(price), .groups = "drop")

  

Exercise 6.9: Audit sparse columns

Difficulty: Advanced. List columns where >25% of rows are NA in airquality.

Show solution
RInteractive R
airquality |> summarise(across(everything(), ~ mean(is.na(.x)))) |> pivot_longer(everything()) |> filter(value > 0.25)

  

Exercise 6.10: Decision-quality EDA report

Difficulty: Advanced. Build a one-page EDA: profile + 3 plots (univariate hist, group boxplot, correlation heatmap).

Show solution
RInteractive R
# Profile print(summary(mtcars)) # Plot 1: distribution print(ggplot(mtcars, aes(mpg)) + geom_histogram(bins = 15)) # Plot 2: group comparison print(ggplot(mtcars, aes(factor(cyl), mpg)) + geom_boxplot()) # Plot 3: relationship print(cor(mtcars) |> as.data.frame() |> rownames_to_column("v1") |> pivot_longer(-v1, names_to = "v2") |> ggplot(aes(v1, v2, fill = value)) + geom_tile() + scale_fill_gradient2(low = "blue", high = "red"))

  

What to do next

After 50 EDA problems you should walk into a new dataset and have a profile in 5 minutes. Natural follow-ups:

  • Data-Wrangling-Exercises (shipped), the cleaning that EDA reveals.
  • Linear-Regression-Exercises (shipped), the modeling that EDA precedes.
  • Data-Visualization-Exercises (coming), viz beyond the EDA basics.