recipes step_corr() in R: Drop Correlated Predictors
The recipes step_corr() function in R removes predictors that are highly correlated with another column, judged by a tunable correlation threshold, before the redundancy can destabilize a model.
step_corr(rec, all_numeric_predictors()) # screen numeric predictors step_corr(rec, all_numeric_predictors(), threshold = 0.8) # stricter cutoff step_corr(rec, x, y, z) # named columns only step_corr(rec, all_numeric_predictors(), method = "spearman") # rank correlation step_corr(rec, all_numeric_predictors(), use = "complete.obs") # NA handling prep(rec); bake(rec, new_data = NULL) # train, then apply tidy(prep(rec), number = 1) # list removed columns
Need explanation? Read on for examples and pitfalls.
What step_corr() does
step_corr() deletes columns that duplicate the information already carried by another predictor. When two columns move together, a model gains nothing from the second one, and linear models in particular return unstable coefficients and inflated standard errors. The recipes step_corr step screens that redundancy out before the model ever sees it.
step_corr() works in two stages, like every recipe step. During prep() it builds the correlation matrix of the columns you selected, finds every pair whose absolute correlation exceeds threshold, and records which columns to drop. bake() then removes those columns from the training data and from any new data you pass through later.
step_corr() syntax and arguments
step_corr() is a selection step controlled by a single threshold plus a correlation method. You name columns directly or, more often, use the tidyselect helper all_numeric_predictors(), since correlation is only defined for numeric data.
The threshold argument is the only knob most pipelines touch: a column pair is flagged when its absolute correlation is above this value, so 0.9 is permissive and 0.75 is aggressive. The removals slot is filled automatically at prep() time with the names of the dropped columns. Full argument detail lives in the recipes step_corr reference.
step_corr() examples
Every example follows the prep-then-bake rhythm. The frame below has a core_metric column and two noisy copies of it, proxy_one and proxy_two, plus an unrelated column. core_metric correlates about 0.93 with each proxy, while the two proxies correlate only about 0.86 with each other.
At threshold = 0.9 only core_metric exceeds the cutoff, and it does so against both proxies. Because it is the most connected column, step_corr() drops it and keeps the two proxies, whose mutual correlation of 0.86 stays under the line. To confirm exactly what left, call tidy() on the prepared recipe.
The terms column names every dropped predictor; an empty tibble means nothing crossed the threshold. Lowering threshold makes the filter stricter and catches milder relationships.
At 0.8 the three correlated columns form one cluster, so step_corr() keeps a single representative and drops the other two. The crucial property is that this decision is learned once during prep() and then applied unchanged to fresh data.
The test frame loses core_metric even though its own correlations were never measured. That is the point: the column list is fixed from training, so train and test always end with the same schema and no information leaks across the split. The same logic scales to real data.
mtcars has one predictor pair above 0.9: cylinder count and displacement correlate at 0.90. step_corr() drops whichever of the two carries the higher average correlation with the remaining predictors, leaving 10 of the original 11 columns.
VarianceThreshold only screens low-variance columns; the closest match to step_corr() is the DropCorrelatedFeatures transformer from the feature-engine package, or a manual filter on the correlation matrix.step_corr() vs step_pca()
Both steps tackle correlated predictors, but they leave very different data behind. Choosing wrong either keeps redundancy or destroys the column names your report depends on.
| Step | What it does | Original columns kept? |
|---|---|---|
step_corr() |
Deletes redundant correlated columns | Yes, survivors are unchanged |
step_pca() |
Replaces a correlated set with orthogonal components | No, creates PC1, PC2, ... |
step_lincomb() |
Removes columns that are exact linear combinations | Yes, survivors are unchanged |
Reach for step_corr() when you want a smaller model that still uses interpretable, named predictors. Choose step_pca() when many predictors are correlated and you accept anonymous components in exchange for compression. Use step_lincomb() for the narrower case of exact redundancy, such as dummy columns that sum to a constant.
Common pitfalls
Three mistakes account for most step_corr() confusion.
- Selecting non-numeric columns. Correlation is numeric-only. Use
all_numeric_predictors()rather thanall_predictors(), orstep_corr()will fail or quietly ignore factor columns depending on the recipes version. - Assuming it keeps the predictor you care about. The filter ranks columns by average correlation, never by relationship to the outcome. It can discard the variable you find most interpretable and keep its twin.
- Setting the threshold too high to do anything. The default
0.9only catches near-duplicates. If you expect a leaner model, drop the threshold toward0.8or0.75and inspecttidy()to see what changed.
tidy() after prep() to confirm exactly what was removed.Try it yourself
Try it: Build a recipe on the frame below, add step_corr() to the numeric predictors with threshold = 0.9, and bake the training data. The revenue and revenue_copy columns are near-identical, so one should disappear. Save the baked frame to ex_baked.
Click to reveal solution
Explanation: revenue and revenue_copy correlate above 0.99, so step_corr() flags the pair during prep() and bake() drops one column. The frame goes from five columns to four.
Related recipes functions
step_corr() is one filter in a larger column-cleaning toolkit. These steps often appear together in the same recipe:
step_nzv()removes near-zero-variance columns that are overwhelmingly one value.step_zv()removes strictly constant columns with exactly one unique value.step_lincomb()removes columns that form exact linear combinations of others.step_pca()compresses a correlated set into orthogonal principal components.step_normalize()centers and scales numeric predictors, a common step before correlation filtering.
FAQ
What does step_corr() do in R?
step_corr() is a recipes step that removes predictors highly correlated with another column. During prep() it computes the correlation matrix of the selected columns and flags every pair whose absolute correlation exceeds threshold, which defaults to 0.9. For each flagged pair it keeps the less redundant column and marks the other for removal. bake() then drops the marked columns from the training data and from any new data passed through the recipe later.
How does step_corr() decide which column to remove?
For each pair above the threshold, step_corr() compares the two columns by their average absolute correlation with all other predictors. It keeps the column with the lower average and removes the one that is more redundant overall. This is the same heuristic used by caret's findCorrelation(). The decision never considers the outcome variable, so removal is based purely on predictor-to-predictor relationships.
What is the difference between step_corr() and step_pca()?
step_corr() deletes redundant columns and keeps the survivors exactly as they were, so your model still uses named, interpretable predictors. step_pca() instead replaces a correlated group with new orthogonal components named PC1, PC2, and so on. Use step_corr() when interpretability matters and step_pca() when you want maximum compression and accept anonymous components in return.
Why did step_corr() not remove any columns?
The most common reason is a threshold set too high for the data. The default 0.9 only catches near-duplicate columns, so moderately correlated predictors survive. Lower threshold toward 0.8 or 0.75 to catch them. The other reason is that no selected columns are numeric: correlation is undefined for factors, so step_corr() has nothing to measure unless you select numeric predictors.