recipes step_pls() in R: PLS Feature Extraction
The recipes step_pls() function adds a supervised partial least squares (PLS) feature extraction step to a tidymodels recipe, replacing many correlated predictors with a few components built to predict your outcome.
step_pls(rec, all_numeric_predictors(), outcome = "mpg") # 2 comps (default) step_pls(rec, all_numeric_predictors(), outcome = "mpg", num_comp = 5) # n components step_pls(rec, disp, hp, wt, outcome = "mpg") # name predictors step_pls(rec, all_numeric_predictors(), outcome = "mpg", predictor_prop = 0.5) # sparse step_pls(rec, all_numeric_predictors(), outcome = "mpg", num_comp = tune()) # tunable rec |> prep() |> tidy(number = 1) # inspect loadings
Need explanation? Read on for examples and pitfalls.
What step_pls() does
step_pls() turns many predictors into a few outcome-aware components. Partial least squares is a supervised cousin of principal component analysis. Where PCA finds directions of maximum predictor variance, PLS finds directions that are maximally correlated with the outcome you supply. Inside a recipe, step_pls() removes the selected predictor columns and writes num_comp new columns named PLS1, PLS2, and so on.
Because the components are steered toward the response, a model usually needs fewer PLS components than PCA components to reach the same accuracy. That makes step_pls() a strong choice when predictors are numerous, correlated, or collinear and you still want a compact, predictive feature set.
step_pca() compresses predictors blindly; step_pls() compresses them toward the outcome. If your reduced features will feed a predictive model, the supervised version almost always carries more signal per component.step_pls() syntax and arguments
The function takes a recipe, a set of predictors, and the outcome to align against. Its signature is step_pls(recipe, ..., role = "predictor", outcome = NULL, num_comp = 2, predictor_prop = 1, options = list(predictor_scaling = TRUE), skip = FALSE, id = ...). The arguments you tune most often are below.
| Argument | Default | Purpose |
|---|---|---|
... |
none | Predictor columns fed into PLS, set with selectors like all_numeric_predictors(). |
outcome |
NULL |
The response variable. Components are built to predict it. Required. |
num_comp |
2 |
Number of PLS components retained as new columns. |
predictor_prop |
~1 |
Maximum proportion of predictors allowed nonzero per component (sparse PLS). |
options |
list(...) |
Passed to the underlying mixOmics fitter, including predictor_scaling. |
skip |
FALSE |
If TRUE, the step is skipped when bake() runs on new data. |
prep(). Install it once with BiocManager::install("mixOmics") before building any recipe that uses this step.step_pls() examples
These examples use mtcars, predicting mpg from the other numeric columns. Each block builds on the recipe object from the first.
Start by declaring the recipe and adding the PLS step. Nothing is computed yet, the recipe only records the plan.
Call prep() to estimate the PLS rotation on the training data, then bake() to apply it. The 10 predictors collapse into 3 component columns.
To see how each original predictor contributes to a component, call tidy() on the prepped recipe. The value column is the loading.
The num_comp argument controls how many components you keep. Use a single component for a strong baseline, or more when one direction is not enough.
num_comp = tune() and let tune_grid() pick the value that minimizes cross-validated error. A common range is 1 to the number of predictors, capped well below it.Setting predictor_prop below 1 switches to sparse PLS, which forces some loadings to exactly zero. That yields components built from a subset of predictors, which is easier to interpret.
step_pls() vs step_pca() and other reduction steps
Pick step_pls() when the reduced features will feed a predictive model. The other reduction steps ignore the outcome, so they optimize a different objective. The table compares the four most common choices.
| Step | Uses the outcome? | Best for | Component basis |
|---|---|---|---|
step_pls() |
Yes (supervised) | Predictive features from correlated predictors | Outcome-aligned directions |
step_pca() |
No (unsupervised) | Variance compression, exploratory analysis | Maximum predictor variance |
step_kpca() |
No | Nonlinear structure in predictors | Kernel-mapped variance |
step_ica() |
No | Separating independent source signals | Statistical independence |
The decision rule is simple. If you have a target variable and you want fewer, stronger features, reach for step_pls(). If you only want to compress predictors or explore them, step_pca() is the lighter, dependency-free option.
step_pls() is sklearn.cross_decomposition.PLSRegression, applied inside a Pipeline before the estimator.Common pitfalls
Three mistakes cause most step_pls() failures. Each one below shows the error and the fix.
Omitting outcome is the most frequent. PLS is supervised, so the step cannot run without a response.
The second is the missing Bioconductor dependency. Install mixOmics once and the step works for every recipe afterward.
The third is data leakage. PLS components are estimated using the outcome, so calling prep() on the full dataset lets test-set responses shape your features. Always prep() on training data only.
Try it yourself
Try it: Build a recipe on mtcars that extracts 2 PLS components for mpg, then count the columns the baked data has. Save the result to ex_pls.
Click to reveal solution
Explanation: step_pls() replaces the 10 numeric predictors with num_comp = 2 component columns. The baked tibble keeps the mpg outcome plus PLS1 and PLS2, which is 3 columns total.
Related recipes functions
These steps pair naturally with step_pls() in a preprocessing pipeline:
- step_pca() for unsupervised variance-based dimension reduction.
- step_kpca() for nonlinear, kernel-based reduction.
- step_ica() to separate independent source signals.
- step_normalize() to center and scale predictors before PLS.
- step_corr() to drop highly correlated predictors as a lighter alternative.
See the official recipes step_pls() reference for the full argument list.
FAQ
What package does step_pls() need? step_pls() relies on the mixOmics package, which lives on Bioconductor rather than CRAN. Installing recipes alone is not enough. Run BiocManager::install("mixOmics") once, and every recipe that uses step_pls() will then prep without error. If you skip this, prep() stops with a message naming the missing package, so the fix is always clear.
What is the difference between step_pls() and step_pca()? Both reduce many predictors into a few components, but step_pca() is unsupervised and step_pls() is supervised. PCA finds directions of maximum predictor variance, ignoring the outcome. PLS finds directions that best predict the outcome you pass through the outcome argument. For modeling, PLS components usually carry more predictive signal per component, so you need fewer of them.
How many components should I use in step_pls()? Start with the default of 2 and tune from there. Set num_comp = tune() and use tune_grid() with cross-validation to pick the value that minimizes prediction error. The useful range runs from 1 up to the number of predictors, but the optimum is usually small because each PLS component is outcome-aligned and information-dense.
Can step_pls() be used for classification? Yes. When the outcome is a factor, step_pls() performs PLS discriminant analysis (PLS-DA) under the hood, producing components that separate the classes. The recipe syntax is identical, you just point outcome at a categorical column instead of a numeric one.
Does step_pls() scale the predictors automatically? Yes. By default the options argument sets predictor_scaling = TRUE, so predictors are centered and scaled before the PLS fit. This prevents large-range variables from dominating the components. You can still add step_normalize() earlier if other steps in your recipe need scaled inputs too.