broom tidy() for prcomp in R: PCA Loadings and Scores
The broom::tidy() function turns a prcomp object into a long-format tibble of loadings, scores, or variance-explained per principal component. It is the fastest way to pipe PCA results into dplyr and ggplot2 without reaching into the fitted list.
tidy(pc) # loadings: variable, PC, value (default) tidy(pc, matrix = "v") # same as default; "v" or "rotation" tidy(pc, matrix = "u") # scores: row, PC, value tidy(pc, matrix = "d") # variance per PC: std.dev, percent, cumulative glance(pc) # one row: total variance and rank augment(pc, data = iris[, 1:4]) # append .fittedPC1..k to rows tidy(pc, matrix = "d") |> filter(percent > 0.05) # keep PCs above 5% variance
Need explanation? Read on for examples and pitfalls.
What tidy() does for prcomp in one sentence
The tidier reshapes prcomp's matrix output into a long tibble keyed by component. A fitted prcomp object is an S3 list with separate matrix slots for the rotation (loadings), the principal component scores, and the standard deviations per PC. The tidier reshapes one slot into a long data frame so dplyr verbs and ggplot geoms work without manual matrix wrangling.
The matrix argument picks the slot. The default returns loadings; pass "u" for scores or "d" for the variance summary.
Syntax
tidy.prcomp() is the S3 method broom dispatches when you call tidy() on a prcomp fit. Two arguments matter: the fitted object and the matrix shortcut that selects which slot to tidy.
The two arguments worth knowing:
x: the fittedprcompobject (required)matrix: one of"v"/"rotation"/"variables"(loadings, default),"u"/"x"/"samples"/"scores"(PC scores), or"d"/"pcs"/"eigenvalues"(variance per PC)
The aliases exist for backward compatibility. Pick the one that reads best in your code.
scale. = TRUE to prcomp() when variables are measured on different scales. Without scaling, the variable with the largest variance dominates the loadings and the tidy output is misleading. Centering alone (center = TRUE) is not enough.Common patterns
1. Tidy the loadings (matrix = "v")
Each row is one (variable, PC) pair. value is the rotation weight, a number between -1 and 1 indicating how strongly that variable loads onto that component. For the iris fit, PC1 has positive loadings on three of the four variables, meaning it picks up overall flower size. PC2 is dominated by Sepal.Width with the opposite sign, separating wide-sepal narrow-flower observations.
The long format pays off immediately. To rank variables by their absolute contribution to PC1, run tidy(pc, matrix = "v") |> filter(PC == 1) |> arrange(desc(abs(value))), then take the top three rows. The same one-liner is the foundation for variable-importance reporting in unsupervised pipelines.
2. Tidy the scores (matrix = "u")
Each row is one (observation, PC) pair. The row index lines up with the row number of the data you passed to prcomp(), so you can left-join back to the original iris frame on row. For a scatter plot of PC1 versus PC2, pivot wide on PC first, or use augment() (next section) to skip the reshape. The long shape is also right when you want facets across all PCs in a single ggplot call.
3. Tidy the variance summary (matrix = "d")
This is the scree-plot source. percent is the fraction of total variance explained by each PC, and cumulative is the running sum. For iris, PC1 alone captures 73 percent of the variance and the first two PCs together hit 96 percent, so a 2D plot of the first two scores carries almost all the signal. Filter on cumulative to pick the smallest PC subset that retains 95 percent of variance, which is the standard cutoff for dimension reduction before downstream modeling.
The std.dev column equals the square root of the eigenvalue of each PC, so squaring it gives the raw eigenvalue if you ever need the Kaiser criterion (drop PCs with eigenvalue below 1 on scaled data).
4. Attach scores to the data with augment()
augment() returns the original frame with .fittedPC1, .fittedPC2, and so on appended. This is the shortcut for the common case: a wide score table joined to grouping columns, ready to feed into ggplot(aug, aes(.fittedPC1, .fittedPC2, color = Species)). Pass the full data frame (not just the numeric columns) to keep grouping variables for faceting.
"v") tell you what each PC means in terms of the original variables. Scores ("u") tell you where each observation sits in PC space. Variance ("d") tells you how many PCs you actually need. Choosing the right matrix is the whole skill; the tidy output is the same shape regardless of which one you pick.tidy.prcomp() vs str(pc) and factoextra
Three tools cover the same job from different angles. Pick by what you do next with the output.
| Tool | Output type | Best for |
|---|---|---|
str(pc) |
nested list summary | Quick console inspection |
broom::tidy(pc, matrix = ...) |
long tibble | dplyr piping, ggplot, joins |
factoextra::fviz_pca_biplot(pc) |
ggplot object | Publication-ready biplot |
Use tidy() whenever the next step is code: filtering loadings, joining scores to grouping variables, building a custom scree plot. Use factoextra for the final figure; it does not return a tibble.
Common pitfalls
Pitfall 1: forgetting to scale before tidying. Without scale. = TRUE, the loadings tibble reports raw covariance directions, not correlation directions. The variable with the largest unit hijacks PC1 even when its standardized variance is small.
Petal.Length appears to dominate PC1 only because its raw variance is larger. The scaled fit above split the loading evenly across the three correlated size variables, which is the correct read.
Pitfall 2: confusing matrix aliases. tidy(pc, matrix = "scores"), matrix = "u", and matrix = "x" all return scores. The slot names in pc itself use a different convention (pc$x for scores, pc$rotation for loadings). Pick one alias and use it consistently.
glance(pc) returns only two columns and does not include tot.var. Unlike glance.kmeans or glance.lm, the prcomp glance method is minimal: r (rank of the fit) and var.explained is NOT returned. For variance summary use tidy(pc, matrix = "d"), not glance(pc).Pitfall 3: passing a princomp fit by mistake. Base R has both prcomp() (SVD-based) and princomp() (eigendecomposition, legacy). tidy() covers both, but column conventions differ. Stick with prcomp() for consistent broom output.
Try it yourself
Try it: Fit prcomp() on the four numeric columns of iris with scaling, then use tidy() to get the variance summary. Filter to the smallest set of PCs that together explain at least 95 percent of variance. Save the result to ex_var.
Click to reveal solution
Explanation: The lag() trick keeps the first PC that crosses the 95 percent threshold so the cumulative total actually reaches the target. A naive filter(cumulative <= 0.95) would drop PC2 because its cumulative is already 0.958, leaving you below the goal.
Related broom functions for PCA
After mastering tidy() for prcomp, look at:
augment.prcomp(): original data with.fittedPC1,.fittedPC2, and more appended as columnsglance.prcomp(): one row with the rank of the decompositiontidy.princomp(): same idea forprincomp()fits (legacy eigendecomposition)factoextra::fviz_pca_var(): ggplot variable-loadings plot built on broom outputfactoextra::fviz_eig(): scree plot wrapper that consumes the same variance tibble
For a quick loadings heatmap, pipe the matrix = "v" output into ggplot2::geom_tile(). See the official broom reference for prcomp tidiers for the full column dictionary per matrix option.
FAQ
How do I get PCA loadings as a data frame in R?
Call broom::tidy(pc, matrix = "v") on a fitted prcomp object. Each row is one (variable, PC) pair with the rotation weight in the value column. This is the long-format equivalent of pc$rotation, which is a matrix. Use the long form for ggplot or dplyr; pivot wide with tidyr::pivot_wider(names_from = PC) if you need the classic loadings matrix back.
What is the difference between matrix = "u", "v", and "d" in tidy.prcomp?
matrix = "v" returns the loadings (default). matrix = "u" returns the principal component scores, one row per observation per PC. matrix = "d" returns the variance summary per PC. The letters come from the SVD X = U D V', which is how prcomp() computes the PCA internally.
Can tidy() build a scree plot directly?
Not directly, but the data is one pipe away. Run tidy(pc, matrix = "d") |> ggplot(aes(PC, percent)) + geom_col() for a percent-variance scree plot. Swap percent for cumulative for the cumulative version, or use factoextra::fviz_eig(pc) for a publication-ready figure with reference lines.
Does tidy work with prcomp on tibbles?
Yes. prcomp() accepts both data frames and tibbles, and the broom method handles both identically. The output is always a tibble regardless of input class. Pass the original tibble to augment() to keep grouping columns.
Why does augment() return more rows than I expected?
It does not; augment() returns exactly as many rows as the data you pass. If you see more rows, check whether you accidentally passed tidy(pc, matrix = "u") (rows = observations x PCs) into a downstream step expecting one row per observation. The augmented output preserves your input row count.