Clustering Exercises in R: 20 Practice Problems

Twenty practice problems on clustering in R: k-means, hierarchical, DBSCAN, silhouette, elbow method, visualization. Hidden solutions.

RRun this once before any exercise
library(dplyr) library(ggplot2) library(cluster) library(dbscan) library(mclust)

  

Exercise 1: k-means basic

Difficulty: Beginner.

Show solution
RInteractive R
set.seed(1) km <- kmeans(iris[, 1:4], centers = 3) table(km$cluster, iris$Species)

  

Exercise 2: Plot k-means clusters

Difficulty: Intermediate.

Show solution
RInteractive R
set.seed(1) km <- kmeans(iris[, 1:4], centers = 3) iris$cluster <- factor(km$cluster) ggplot(iris, aes(Sepal.Length, Petal.Length, color = cluster)) + geom_point()

  

Exercise 3: nstart parameter

Difficulty: Intermediate.

Show solution
RInteractive R
set.seed(1) kmeans(iris[, 1:4], centers = 3, nstart = 25)

  

Exercise 4: Scale before k-means

Difficulty: Intermediate.

Show solution
RInteractive R
set.seed(1) kmeans(scale(iris[, 1:4]), centers = 3)

  

Exercise 5: Elbow method

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) wss <- sapply(1:10, function(k) kmeans(scale(iris[,1:4]), centers = k, nstart = 10)$tot.withinss) plot(1:10, wss, type = "b")

  

Exercise 6: Silhouette

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) km <- kmeans(scale(iris[,1:4]), centers = 3) sil <- silhouette(km$cluster, dist(scale(iris[,1:4]))) mean(sil[, 3])

  

Exercise 7: Hierarchical clustering

Difficulty: Intermediate.

Show solution
RInteractive R
d <- dist(scale(iris[, 1:4])) hc <- hclust(d, method = "complete") cutree(hc, k = 3) |> table()

  

Exercise 8: Plot dendrogram

Difficulty: Intermediate.

Show solution
RInteractive R
hc <- hclust(dist(scale(iris[, 1:4]))) plot(hc)

  

Exercise 9: Different linkage methods

Difficulty: Advanced.

Show solution
RInteractive R
d <- dist(scale(iris[,1:4])) list(complete = hclust(d, method = "complete"), ward = hclust(d, method = "ward.D2"), single = hclust(d, method = "single"))

  

Exercise 10: DBSCAN

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) dbscan::dbscan(scale(iris[, 1:4]), eps = 0.5, minPts = 5)

  

Exercise 11: Compare k-means vs hierarchical

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) km <- kmeans(scale(iris[,1:4]), 3) hc <- cutree(hclust(dist(scale(iris[,1:4]))), 3) table(km$cluster, hc)

  

Exercise 12: Cluster centroids

Difficulty: Beginner.

Show solution
RInteractive R
set.seed(1) km <- kmeans(iris[, 1:4], 3) km$centers

  

Exercise 13: Within-cluster sum of squares

Difficulty: Beginner.

Show solution
RInteractive R
set.seed(1) kmeans(iris[, 1:4], 3)$tot.withinss

  

Exercise 14: PAM (k-medoids)

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) pam_fit <- cluster::pam(scale(iris[,1:4]), k = 3) pam_fit$clusinfo

  

Exercise 15: Gap statistic

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) gap <- cluster::clusGap(scale(iris[,1:4]), FUN = kmeans, K.max = 8, B = 50) plot(gap)

  

Exercise 16: Visualize hierarchical clusters via cuts

Difficulty: Intermediate.

Show solution
RInteractive R
hc <- hclust(dist(scale(iris[,1:4]))) iris$cluster <- factor(cutree(hc, k = 3)) ggplot(iris, aes(Sepal.Length, Petal.Length, color = cluster)) + geom_point()

  

Exercise 17: Predict new point to nearest centroid

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) km <- kmeans(scale(iris[,1:4]), 3) new_point <- scale(iris[1, 1:4]) which.min(sqrt(rowSums((km$centers - matrix(rep(new_point, 3), nrow = 3))^2)))

  

Exercise 18: External validation with ARI

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) km <- kmeans(scale(iris[,1:4]), 3) mclust::adjustedRandIndex(km$cluster, as.integer(iris$Species))

  

Exercise 19: Initialise k-means with k-means++

Difficulty: Advanced.

Show solution
RInteractive R
set.seed(1) LICORS::kmeanspp(scale(iris[,1:4]), k = 3)$cluster |> table(iris$Species)

  

Exercise 20: Mini-batch k-means demo (concept)

Difficulty: Advanced.

Show solution
RInteractive R
# ClusterR::MiniBatchKmeans for very large data # m <- ClusterR::MiniBatchKmeans(scale(iris[,1:4]), clusters = 3, batch_size = 30) # Result: faster than kmeans on millions of rows

  

What to do next

  • PCA-Exercises (shipped), dimension reduction before clustering.
  • Machine-Learning-Exercises (shipped), broader ML drills.