PCA Exercises in R: 10 Principal Component Analysis Practice Problems, Solved Step-by-Step
These 10 PCA exercises in R take you from your first prcomp() fit through scaling decisions, scree plots, loadings, biplots, downstream regression on principal scores, reconstruction error, and an SVD equivalence check. Every problem is solved step by step with runnable R code and a click-to-reveal explanation.
How do you fit PCA in R and read variance explained?
PCA in R lives inside one base function, prcomp(), that returns four useful objects in a list: standard deviations of each component, the rotation (loading) matrix, the scores, and the centring/scaling vectors used. Most exercises below pull from those four slots, so the first job is to fit a PCA cleanly and read off the variance each component explains. We use iris[, 1:4] for the warm-up because four numeric columns and three known species make the result easy to interpret.
PC1 absorbs 73% of the variance, PC2 another 23%, and the cumulative line shows two components already capture 96% of what is going on across all four flower measurements. That is a textbook PCA result: a four-dimensional dataset is genuinely two-dimensional in disguise, which is why every iris scatter you have ever seen uses PC1 and PC2 as the axes.
sdev^2 is an eigenvalue of the correlation matrix, and the proportions in summary() are just sdev^2 / sum(sdev^2). The total variance after scaling equals the number of variables, so the eigenvalues divide a known total.The eigenvalues sum to 4 because we scaled four columns to unit variance; PCA just redistributes that total variance into uncorrelated directions, ranked from biggest first.
summary()$importance directly. Row 1 is Standard deviation, row 2 is Proportion of Variance, and row 3 is Cumulative Proportion. So summary(fit)$importance[3, ] gives you the cumulative-variance vector named by component, ready for which() or cumsum() checks.Try it: From the iris_pca fit above, pull the proportion of variance explained by PC3 alone (one number, not cumulative). Aim for a one-liner.
Click to reveal solution
Explanation: Row 2 of the importance matrix is Proportion of Variance, and indexing by the column name "PC3" plucks PC3's slice. PC3 carries less than 4% of the variance, which is why PCA work on iris stops at PC1 and PC2.
How do you decide how many components to keep?
Three quick decision tools cover most needs. The scree plot is a visual elbow test: plot eigenvalues against component index and look for the kink. The Kaiser rule keeps components with sdev^2 > 1 (only valid on a scaled fit, where each original variable contributes variance 1). The 80% rule keeps the smallest k whose cumulative variance crosses your target threshold. The exercises below use all three.
The line drops sharply between PC1 and PC2, then flattens. That is the elbow, and it agrees with the Kaiser rule here: only PC1 has eigenvalue greater than 1, so a one-component summary already does most of the work for crime data across states.
scale = TRUE first, otherwise the threshold is meaningless.The which(... >= 0.80)[1] pattern returns the first index that crosses your threshold. For USArrests, two components capture 87% of the variance, comfortably past 80%. The same one-liner becomes useful in Exercise 4 below.
Try it: Count how many components in arr_pca pass the Kaiser rule (eigenvalue > 1). The answer should match what the scree plot suggested.
Click to reveal solution
Explanation: arr_pca$sdev^2 is the eigenvalue vector. Only PC1's eigenvalue (about 2.48) exceeds 1; PC2 sits just under at 0.99. So Kaiser says "keep one", but most analysts would still keep PC2 because the 80% rule disagrees and PC2 explains a meaningful 25%.
Practice Exercises
The 10 problems below ramp from a clean first fit through to outlier detection on principal scores. Every exercise uses an ex<N>_ variable prefix so your work does not overwrite the tutorial fits above. Run the starter, attempt the solution, then click to reveal.
Exercise 1: Cumulative variance at PC2 on mtcars
Fit PCA on mtcars with scaling and report the cumulative variance captured by the first two components. Save the value to ex1_cum2.
Click to reveal solution
Explanation: Two components capture 84% of the variance across the eleven mtcars columns. That is high enough that a 2D PCA scatter (e.g., colour by cyl) is a faithful summary of the cars in this dataset.
Exercise 2: Scaled vs unscaled loadings on USArrests
Fit two PCAs on USArrests, one with scale = TRUE and one with scale = FALSE. Correlate the absolute PC1 loadings between the two fits. A correlation far below 1 means scaling genuinely changed which variables PC1 cares about. Save the correlation to ex2_cor.
Click to reveal solution
Explanation: A correlation of about 0.17 says the two fits barely agree on what PC1 represents. Without scaling, Assault (range in hundreds) dominates PC1 because of its raw variance; with scaling, all four crime columns contribute. This is why scale = TRUE is the safe default whenever your columns are on different units.
Exercise 3: Kaiser-passing components on iris
On a scaled PCA of iris[, 1:4], count the components whose eigenvalue exceeds 1. Save the count to ex3_count.
Click to reveal solution
Explanation: Only PC1's eigenvalue (about 2.92) exceeds 1; PC2 sits at 0.91, just under the threshold. Kaiser would prune iris to a single component. In practice most analysts keep PC2 too, because cumulative variance reaches 96% with two components and the visualisation gain is large.
Exercise 4: Smallest k for the 80 percent rule on swiss
Fit PCA on the built-in swiss dataset (47 Swiss provinces, 6 numeric columns) with scaling, then find the smallest k whose cumulative variance reaches 80%. Save the answer to ex4_k.
Click to reveal solution
Explanation: Three components clear 80% on swiss, where the original six demographic and economic columns are moderately but not heavily correlated. Compared to USArrests (k = 2) or iris (k = 2), swiss shows that not every dataset compresses to two components, and the 80% rule is the easiest way to make that call programmatically.
Exercise 5: Top two variables driving PC1 on iris
Using a scaled PCA on iris[, 1:4], identify the two variables with the largest absolute loadings on PC1. Save the named vector of the top two values to ex5_top2.
Click to reveal solution
Explanation: Petal measurements dominate PC1 because they vary the most across species. Sepal.Length is third (loading 0.52) and Sepal.Width is a distant fourth (0.27, with negative sign). This is the structural reason PC1 alone separates setosa from the other two iris species.
Exercise 6: Biplot interpretation for USArrests
Fit a scaled PCA on USArrests, draw a base R biplot(), and read off which states sit at the high-crime end of PC1. Save the fit to ex6_pca.
Click to reveal solution
Explanation: PC1 in USArrests is essentially a "violent-crime" axis: California, Florida, Nevada and Texas sit at one end (high Murder, Assault, Rape loadings); North Dakota, Vermont and New Hampshire sit at the opposite end. PC2 separates urban from rural by playing UrbanPop against the crime trio. Biplots are the fastest way to read both scores and loadings at once.
Exercise 7: PCA scores as regression inputs
Fit PCA on mtcars predictors only (drop mpg), then regress mpg on the first two principal-component scores. Report the R² of that regression and compare with a plain lm(mpg ~ ., data = mtcars). Save the PCR R² to ex7_r2_pcr and the full-OLS R² to ex7_r2_full.
Click to reveal solution
Explanation: Two principal components recover 85% of the variance in mpg, while ten raw predictors plus an intercept reach 87%. Spending eight extra degrees of freedom for two extra R² points is a poor trade, which is the whole pitch for principal components regression: shrink the predictor matrix, keep almost all of the predictive signal, get a more stable model.
Exercise 8: Frobenius reconstruction error from k = 2
For a scaled PCA of USArrests, reconstruct the data using only the first two components, then compute the Frobenius norm of the difference against the original scaled matrix. Save the error to ex8_err.
The reconstruction formula is the inverse of the rotation:
$$\hat{X}_k = U_k D_k V_k^\top = \mathrm{scores}_{[, 1:k]} \cdot \mathrm{rotation}_{[, 1:k]}^\top$$
Where the scores already absorb the centring and scaling, so the product lands back in scaled-data space.
Click to reveal solution
Explanation: A Frobenius error of about 5.10 on a scaled 50-by-4 matrix is small. The squared error equals (n - 1) times the variance dropped, which is 49 * (sdev[3]^2 + sdev[4]^2) ≈ 49 * 0.530 ≈ 25.97, and sqrt(25.97) ≈ 5.10. That algebraic identity is why eigenvalues equal "variance carried", and dropping low-eigenvalue components is provably the smallest-error rank reduction available.
Exercise 9: prcomp() equals svd(scale(X))
Verify numerically that the rotation matrix from prcomp() matches the right singular vectors from svd() applied to the scaled data, up to per-column sign flips. Save the maximum absolute element-wise difference (after sign correction) to ex9_diff.
Click to reveal solution
Explanation: The two matrices match to machine precision. prcomp() is just SVD on the centred (and optionally scaled) data with the singular values stored as sdev. Knowing this equivalence is useful when you want to compute PCA on data too large for prcomp() directly, because svd() and partial-SVD packages (like irlba) give you the same answer with more control.
Exercise 10: Outlier detection with PC scores
Inject one synthetic outlier c(10, 10, 10, 10) into iris[, 1:4], refit a scaled PCA, and flag observations whose (PC1, PC2) score has a Mahalanobis-style distance above the 99th-percentile chi-squared cutoff with 2 degrees of freedom. Save the flagged row indices to ex10_flag.
Click to reveal solution
Explanation: Row 151 (the injected outlier) is the only point flagged, because it sits far from the rest of the cloud in every dimension and lands at extreme (PC1, PC2) scores. PCA-based outlier detection works precisely when you can summarise the bulk of the data in a few components, since "far from the data" then becomes "far in the principal subspace". For real-world use, swap the chi-squared cutoff for a rank-based threshold or Hotelling's T².
Complete Example
The 10 exercises rehearsed individual skills; this end-to-end walkthrough on swiss glues them together into the workflow you would actually run on a new dataset.
We have six numeric columns on different scales (proportions, counts, ratios), so scale = TRUE is mandatory.
Three components clear 80%; we keep them.
PC1 contrasts traditional rural variables (Fertility, Agriculture, Catholic) against modernising ones (Examination, Education); a province scoring high on PC1 is more agrarian and less educated. PC2 is dominated by Infant.Mortality. PC3 separates Catholic from agricultural provinces.
Reconstruction error is small because three components carry 88% of the variance. Drop a fourth in and the error halves; drop two in and it doubles. The trade-off curve is exactly the cumulative-variance curve from Step 2 in disguise.
Summary
| Goal | R call | Key field |
|---|---|---|
| Fit PCA with scaling | prcomp(X, scale = TRUE) |
$sdev, $rotation, $x |
| Variance explained | summary(fit)$importance |
row 2 (proportion), row 3 (cumulative) |
| Eigenvalues | fit$sdev^2 |
numeric vector |
| Kaiser rule | sum(fit$sdev^2 > 1) |
count of components |
| 80% rule | which(cumsum(fit$sdev^2)/sum(fit$sdev^2) >= 0.8)[1] |
k |
| Scores | fit$x[, 1:k] |
n x k matrix |
| Loadings | fit$rotation[, 1:k] |
p x k matrix |
| Biplot | biplot(fit) |
base R figure |
| Reconstruction | scores %*% t(rotation) |
reproduces scaled X |
| SVD link | svd(scale(X))$v == fit$rotation |
up to sign |
References
- R Core Team. prcomp() reference, stats package. Link
- Jolliffe, I.T. Principal Component Analysis, 2nd ed., Springer (2002). Link
- Husson, F., Lê, S., Pagès, J. Exploratory Multivariate Analysis by Example Using R, CRC Press (2017). Link
- Kassambara, A. factoextra package documentation. Link
- James, G., Witten, D., Hastie, T., Tibshirani, R. An Introduction to Statistical Learning, 2nd ed., Chapter 12. Link
- Pearson, K. (1901). "On lines and planes of closest fit to systems of points in space". Philosophical Magazine 2, 559-572.
- Hotelling, H. (1933). "Analysis of a complex of statistical variables into principal components". Journal of Educational Psychology 24, 417-441.
- Wickham, H., Grolemund, G. R for Data Science, 2nd ed. Link
Continue Learning
- PCA in R: prcomp() Tutorial, the parent tutorial that covers scaling, scree plots, biplots, and downstream use in depth.
- Cluster Analysis in R, the natural follow-up: once PCA hands you a low-rank view, k-means and hierarchical clustering operate on those scores.
- Exploratory Factor Analysis in R, PCA's close cousin; useful when you want latent constructs rather than maximum-variance directions.