Cluster Analysis Exercises in R: 10 k-Means & Hierarchical Problems, Solved Step-by-Step
These 10 cluster analysis exercises in R take you from your first kmeans() fit through scaling decisions, the elbow and silhouette diagnostics, nstart stability, hierarchical clustering with hclust(), linkage method comparisons, cophenetic correlation, and the agreement between k-means and ward.D2 hierarchical labels. Every problem is solved step by step with runnable R code and a click-to-reveal explanation.
How do you fit k-means in R and read the cluster output?
k-means in R lives inside one base function, kmeans(), that returns a list with cluster labels, centroid coordinates, cluster sizes, and the within- and between-cluster sums of squares. Most exercises below pull from those slots, so the first job is to fit a clean k-means and read off what each piece means. We use iris[, 1:4] for the warm-up because four numeric columns and three known species make the result easy to interpret.
The fit splits 150 flowers into three groups of size 50, 53, and 47. That is suspiciously close to iris's true 50-50-50 species balance, which is exactly the point: with scaled features and three centroids, k-means largely recovers the species partition without ever seeing the labels.
Total SS in scaled data is fixed at $(n - 1) \times p = 149 \times 4 = 596$. k-means cannot change that total, only redistribute it between within and between. Minimising within-cluster SS is mathematically the same as maximising between-cluster SS, which is why a useful summary is the ratio betweenss / totss, also reported by print(km_fit) as a percentage.
nstart to at least 25. k-means picks random initial centroids, so a single run can land in a poor local minimum. With nstart = 25 the function tries 25 random starts and keeps the lowest within-SS, which costs almost nothing on small data and dramatically improves stability.Try it: From the warm-up km_fit above, find the size of the smallest cluster and save it to ex_min. One short line is enough.
Click to reveal solution
Explanation: km_fit$size is the named vector of cluster counts, and min() plucks the smallest. The answer 47 matches the third cluster, which absorbed slightly fewer flowers than the others because the versicolor-virginica boundary is fuzzy.
How do you build and cut a hierarchical clustering tree?
Hierarchical clustering in R is a two-step recipe, dist() then hclust(), returning a tree you slice at any number of clusters with cutree(). The default method = "complete" is rarely the right pick on continuous data, so the exercises favour method = "ward.D2", which minimises within-cluster variance the same way k-means does and tends to produce balanced groups.
Four clusters of sizes 8, 11, 21, and 10. Ward's linkage delivers a fairly balanced split because it picks merges that grow within-cluster variance the least at each step. Compare this to single linkage, which famously chains points together and produces one giant cluster plus several singletons.
Reading top-down, the tree shows the 50 states being progressively merged from leaves at height 0 up to one cluster at the top. The four red rectangles mark where cutree(k = 4) slices the tree, and you can read off the membership by tracing each state's leaf up to its enclosing rectangle.
ward.D operates on raw distances and is the original Lance-Williams formula; ward.D2 squares the distances first and matches Ward's original 1963 paper. Use ward.D2 unless you have a specific reason to reproduce older code.hclust() defaults to method = "complete". Many tutorials skip the method argument and silently use complete linkage, which exaggerates a few large distances and can chain noisy points together. Make method explicit on every fit so future readers know exactly what tree you built.Try it: Re-cut hc_fit at k = 2 and report the size of the larger of the two clusters, saving it to ex_big.
Click to reveal solution
Explanation: cutree(hc_fit, k = 2) returns a length-50 vector of 1s and 2s, table() counts them, and max() picks the larger count. Cutting at k=2 produces a 20-30 split that maps loosely onto low-crime versus high-crime states, the most basic structure ward.D2 finds in the data.
Practice Exercises
The 10 problems below ramp from a clean first k-means fit through to the agreement between k-means and hierarchical labels. Every exercise uses an ex<N>_ variable prefix so your work does not overwrite the tutorial fits above. Run the starter, attempt the solution, then click to reveal.
Exercise 1: Smallest cluster size for k=3 on iris
Fit k-means on scaled iris[, 1:4] with centers = 3 and nstart = 25 (use set.seed(1)). Save the size of the smallest cluster to ex1_min.
Click to reveal solution
Explanation: The same setup as the warm-up but with a different seed; the answer is still 47 because nstart = 25 makes the result effectively seed-independent on this clean dataset. If you got a different number, check whether you forgot to scale or set nstart.
Exercise 2: Within-SS for k=4 on iris
On the scaled iris features, compare tot.withinss for k = 3 versus k = 4. Save the k=4 within-SS to ex2_within4. Use set.seed(2) for both fits and nstart = 25.
Click to reveal solution
Explanation: Adding a fourth centroid drops within-SS from 139 to 114, but the gain is much smaller than the drop from k=2 to k=3 would be. That diminishing return is exactly what the elbow method (Exercise 3) makes visual: the curve bends sharply where extra clusters stop helping.
Exercise 3: Elbow curve on USArrests
Build the within-SS vector for k = 1 through k = 10 on scaled USArrests, with nstart = 25 and set.seed(3). Save the length-10 numeric vector to ex3_wss. The elbow visible in the result is what factoextra::fviz_nbclust() plots under the hood.
Click to reveal solution
Explanation: The within-SS plummets from 196 at k=1 to 103 at k=2, then bends sharply at k=3 or k=4. After k=4 the curve flattens, telling you that adding more clusters is mostly memorising noise. Plot ex3_wss against 1:10 to see the textbook elbow shape.
Exercise 4: Best k by mean silhouette on iris
For k in 2 through 5, fit k-means on scaled iris and compute the mean silhouette width using cluster::silhouette(). Save the k that maximises mean silhouette to ex4_best_k. Use set.seed(4) and nstart = 25 for every fit.
Click to reveal solution
Explanation: Silhouette picks k=2 even though we know there are three species, because versicolor and virginica overlap heavily in petal/sepal space. This is a famous lesson: silhouette is one diagnostic, not a verdict. Combine it with the elbow plot, domain knowledge, and a quick visual on PC1-PC2 before committing to a final k.
Exercise 5: nstart sensitivity on USArrests
Fit k-means on scaled USArrests at k = 4 twice: once with nstart = 1 and once with nstart = 50. Use set.seed(5) immediately before each fit so both start from the same RNG state. Save the difference (nstart=1 within-SS) - (nstart=50 within-SS) to ex5_diff. A positive value means nstart = 1 got stuck in a worse local minimum.
Click to reveal solution
Explanation: A single random start lands roughly 2% above the best known minimum on this seed; nstart = 50 finds the better solution. The cost of running k-means 50 times on 50 rows is microscopic, and the payoff is a stable, reproducible answer. This exercise is the empirical justification for the nstart >= 25 tip in the warm-up.
Exercise 6: Scaled vs unscaled k-means on USArrests
Fit k-means twice on USArrests, once on the raw matrix and once on scale(USArrests), both with centers = 4, nstart = 25, and set.seed(6) before each fit. Count how many of the 50 states get a different cluster index between the two fits, and save the count to ex6_diff_count. Treat any change in cluster ID as a difference (cluster numbering is arbitrary, so this is a rough upper bound on disagreement).
Click to reveal solution
Explanation: Around 36 of 50 states get a different cluster ID after scaling. Without scaling, Assault (0-300+) dominates Euclidean distance because its raw variance dwarfs Murder (0-17) and UrbanPop (32-91). With scaling, all four columns carry equal weight and the clusters look qualitatively different. Some of the count comes from cluster-ID relabelling rather than genuine disagreement, but even after relabel-matching, scaling reshuffles roughly a third of states.
sum(a != b) overstates disagreement. For an honest comparison, build a contingency table with table(a, b) and look for one dominant cell per row, exactly the trick used in Exercise 10.Exercise 7: Hierarchical cluster sizes on USArrests at k=4
On scaled USArrests, fit hclust() with method = "ward.D2" and cut at k = 4. Save the named vector of cluster sizes to ex7_sizes.
Click to reveal solution
Explanation: Ward.D2 produces a fairly balanced 8-11-21-10 split, meaning no cluster dominates and none is a singleton. Compare this with what k-means at k=4 returned in Exercise 6: the two algorithms tend to agree on broad structure on USArrests, both pulling a large "high-crime urban" cluster out of the middle.
Exercise 8: Linkage method comparison on iris
Fit hclust() on dist(scale(iris[, 1:4])) with each of four linkage methods: single, complete, average, and ward.D2. For each, cut at k = 3 and record the size of the largest cluster. Save the named numeric vector to ex8_max.
Click to reveal solution
Explanation: Single linkage is catastrophic here, dumping 148 of 150 flowers into one cluster and leaving two singletons. Complete (72) and average (97) are imbalanced. Ward.D2 produces 64 in the largest cluster, which is much closer to the true 50-50-50 species split. This is the empirical case for ward.D2 as a sensible default on continuous data.
Exercise 9: Cophenetic correlation on swiss
For the built-in swiss dataset (47 Swiss provinces, 6 numeric columns), fit hclust() with method = "ward.D2" and method = "average". Use cor(d, cophenetic(hc)) to compute the cophenetic correlation for each, where d is the original distance object. Save the larger of the two correlations to ex9_best_coph.
Click to reveal solution
Explanation: Average linkage wins on cophenetic correlation (0.75 vs 0.58) because it explicitly preserves average pairwise distances during merging. Ward.D2 sacrifices cophenetic fidelity to chase balanced clusters. Use cophenetic correlation when you care about the dendrogram as a faithful summary of distances, and ward.D2 when you care about the cluster shapes themselves.
Exercise 10: Agreement of k-means and ward.D2 on iris
Fit both k-means (with set.seed(10) and nstart = 25) and hclust(method = "ward.D2") at k = 3 on scaled iris. Build a contingency table of the two label vectors. Sum the maximum count in each row of that table and save the result to ex10_agree. Because cluster IDs are arbitrary, this counts the largest possible match across re-labellings.
Click to reveal solution
Explanation: 145 of 150 flowers fall into the modal cell of their k-means cluster within the ward.D2 partition, an agreement of 96.7%. The two algorithms see the same broad structure on iris because the species are roughly spherical in scaled feature space. On non-spherical data (e.g., the moons example in the parent tutorial) the agreement collapses, which is when DBSCAN starts to look attractive.
Complete Example
The end-to-end pipeline below combines everything from the exercises into a typical clustering workflow on USArrests. We pick k via mean silhouette, fit a final k-means with nstart = 50, profile each cluster on the original (unscaled) units, and cross-check with a hierarchical fit.
Silhouette picks k=2 cleanly (0.41, well above k=3's 0.31), separating low-crime states from high-crime states. The cluster profiles confirm the split is dominated by Murder (4.9 vs 12.2) and Assault (114 vs 255). The contingency table shows 46 of 50 states get the same broad label from k-means and ward.D2, a 92% agreement, which is the sanity check most cluster reports should include before publication.
Summary
| Tool | When to use it | Quick reminder |
|---|---|---|
kmeans(data, k, nstart=25) |
Round, similar-sized clusters | Always set nstart; total SS = within + between |
dist() + hclust(method="ward.D2") |
Continuous data, balanced groups | Default method="complete" is rarely best |
cutree(hc, k) |
Slice the dendrogram | Returns an integer label per row |
silhouette(cluster, dist) |
Validate cluster cohesion | Mean width above 0.5 is solid; below 0.25 is weak |
Elbow on tot.withinss |
Choose k visually | Look for the bend, not the minimum |
cor(dist, cophenetic(hc)) |
Dendrogram fidelity | Higher means the tree preserves distances |
| Cross-tab k-means vs hclust | Sanity check | High agreement means the structure is real |
References
- Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, 2nd ed., Chapter 14: Unsupervised Learning. Free PDF, Stanford. Link
- Kaufman, L., Rousseeuw, P. J. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley (1990). The original silhouette paper.
- Ward, J. H. (1963). "Hierarchical Grouping to Optimize an Objective Function". Journal of the American Statistical Association, 58(301), 236-244. Link
- R Core Team. kmeans() reference, stats package. Link
- R Core Team. hclust() reference, stats package. Link
- Maechler, M. et al. cluster package: silhouette() reference. Link
- Kassambara, A. Practical Guide to Cluster Analysis in R, STHDA. Link
- Murtagh, F., Legendre, P. (2014). "Ward's Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward's Criterion?". Journal of Classification, 31(3), 274-295.
Continue Learning
- Cluster Analysis in R: k-Means vs Hierarchical vs DBSCAN, the parent tutorial that compares all three algorithms on the same dataset and shows where each one wins or fails.
- PCA in R: prcomp() Tutorial, the natural pre-step: project to a low-rank space before clustering when you have many correlated features.
- PCA Exercises in R, the companion drill set for the dimensionality-reduction half of unsupervised learning.