recipes step_ica() in R: Independent Component Features
The recipes step_ica() function in R adds independent component analysis to a preprocessing pipeline, replacing numeric predictors with a smaller block of statistically independent components before a model is trained.
step_ica(rec, all_numeric_predictors()) # default 5 components step_ica(rec, all_numeric_predictors(), num_comp = 3) # keep 3 components step_ica(rec, x, y, z) # named columns only step_ica(rec, all_numeric_predictors(), seed = 123) # reproducible result step_ica(rec, all_numeric_predictors(), prefix = "ic_") # rename IC1 to ic_1 step_ica(rec, all_numeric_predictors(), keep_original_cols = TRUE) # keep input columns prep(rec); bake(rec, new_data = NULL) # train, then apply tidy(prep(rec), number = 1) # inspect the mixing matrix
Need explanation? Read on for examples and pitfalls.
What step_ica() does
step_ica() converts a group of numeric predictors into independent components. Independent component analysis assumes the observed columns are linear mixtures of hidden source signals and tries to recover those sources. Where principal component analysis only removes correlation, ICA goes further and separates components that are statistically independent, which is a stronger condition.
Like every recipe step, step_ica() works in two stages. During prep() it runs fastICA() on the selected training columns and stores the estimated unmixing matrix. bake() then projects any data, training or new, onto those stored components. The output columns are named IC1, IC2, and so on, and the original predictors are dropped.
step_ica() syntax and arguments
step_ica() is a transformation step sized by a component count. You select columns directly or, far more often, with the tidyselect helper all_numeric_predictors(), since ICA is only defined for numeric data. Unlike step_pca(), there is no variance threshold argument, because ICA does not rank its components by variance.
The num_comp argument fixes how many components survive; it must not exceed the number of selected predictors. The seed argument matters because ICA starts from a random initialization, so a fixed seed makes the result reproducible. The options list is forwarded to fastICA(). Full argument detail lives in the recipes step_ica reference.
fastICA::fastICA() under the hood, so install fastICA before adding it to a recipe. If the package is missing, prep() stops with a clear error naming the dependency.step_ica() examples
Every example follows the prep-then-bake rhythm. The first builds a recipe on mtcars, normalizes the ten numeric predictors, and extracts three independent components. Normalizing first is important and is covered in the pitfalls below.
The ten predictors collapse into IC1, IC2, and IC3, while the outcome mpg passes through untouched. Calling bake() on any frame with the same input columns reuses the stored unmixing matrix, so train and test always land in the same component space.
With num_comp = 5 the baked frame has six columns: five components plus the outcome. Pick a count based on how many independent signals you expect the predictors to contain, or tune it as shown later.
Because the unmixing matrix was learned only from the training set, these three new rows are projected with the exact same transformation. No information leaks from new data back into the fitted components.
Calling tidy() on the prepared recipe with number = 2 points at the second step, step_ica(). The table shows how each original column loads onto a component, which is the closest ICA gives you to an interpretable view of the transformation.
step_ica() vs other reduction steps
step_ica() is one of several recipes steps that shrink a wide predictor set, and they leave different data behind. Choosing wrong either keeps redundancy or discards structure the model needs.
| Step | Transformation | Output columns | Best when |
|---|---|---|---|
step_ica() |
Independent components | IC1, IC2, ... |
Sources are non-Gaussian and independent |
step_pca() |
Linear orthogonal components | PC1, PC2, ... |
Predictors are linearly correlated |
step_kpca() |
Kernel (nonlinear) PCA | kPC1, kPC2, ... |
Structure between predictors is nonlinear |
step_corr() |
Drops redundant columns | Original survivors | You need named, interpretable predictors |
Reach for step_ica() when you believe the predictors are blends of independent underlying signals, a common situation with sensor, audio, or financial data. Choose step_pca() when you simply want a compact, decorrelated input ranked by variance. Use step_corr() when interpretability matters and you want to keep real column names.
num_comp = tune() and let a workflow() plus tune_grid() search for the value that maximizes cross-validated performance. ICA gives no variance ranking to lean on, so tuning is the reliable way to size the output.Common pitfalls
Three mistakes account for most step_ica() confusion.
- Forgetting the seed. ICA starts from a random initialization, so two runs without a fixed
seedproduce different components. Always passseedwhen you need reproducible recipes across sessions. - Skipping normalization. ICA is sensitive to predictor scale. Add
step_normalize()beforestep_ica()so a large-scale column does not distort the recovered signals. - Expecting ordered components. Unlike PCA, ICA does not rank components by importance.
IC1is not more significant thanIC3, so do not drop later components assuming they carry less signal.
IC1, IC2, and so on. Any downstream code that references original column names will break silently. Inspect bake() output once after adding the step.Try it yourself
Try it: Build a recipe on the built-in USArrests dataset, normalize the numeric predictors, and reduce them to two independent components with step_ica(). Use seed = 7. Save the baked training data to ex_ica.
Click to reveal solution
Explanation: USArrests has four numeric columns and no outcome. step_normalize() puts them on a common scale, then step_ica() with num_comp = 2 and a fixed seed projects them onto two reproducible independent components.
Related recipes functions
step_ica() rarely appears alone in a recipe. These steps commonly sit alongside it:
step_normalize()centers and scales predictors, the recommended step beforestep_ica().step_pca()extracts variance-ranked orthogonal components instead of independent ones.step_kpca()runs kernel PCA for nonlinear predictor structure.step_corr()drops correlated columns while keeping the survivors interpretable.step_impute_mean()fills missing values, since ICA cannot handleNAinputs.
FAQ
What does step_ica() do in R?
step_ica() is a recipes step that performs independent component analysis as part of a preprocessing pipeline. During prep() it runs fastICA() on the selected numeric predictors and stores the estimated unmixing matrix. During bake() it projects the data onto those components, replacing the original columns with components named IC1, IC2, and so on. Because the transformation is learned only from training data, the same projection applies cleanly to new data without leakage.
What is the difference between step_ica() and step_pca()?
step_pca() finds orthogonal components ranked by how much variance they capture, so dropping later components is a sensible compression strategy. step_ica() finds components that are statistically independent, a stronger condition, and does not rank them by importance. Use step_pca() for general decorrelation and compression, and step_ica() when the predictors are believed to be mixtures of independent, non-Gaussian source signals.
Why do I need to set a seed in step_ica()?
Independent component analysis starts from a random initialization and iterates to a solution, which makes it a stochastic algorithm. Without a fixed seed, two prep() calls on the same data produce different components and column values. Passing seed makes the recipe reproducible across sessions and machines, which matters when you save a prepared recipe and reuse it later.
Should I normalize before step_ica()?
Yes. step_ica() is sensitive to the scale of its inputs, so a predictor measured in large units can dominate the recovered components. Place step_normalize() before step_ica() in the recipe so every column contributes on a comparable scale. Without it, the independent components reflect measurement units rather than genuine shared structure.