recipes step_isomap() in R: Nonlinear Manifold Step
The recipes step_isomap() function in R adds Isomap manifold embedding to a preprocessing pipeline, replacing numeric predictors with components that follow the curved low-dimensional surface the data lies on rather than its straight-line directions.
step_isomap(rec, all_numeric_predictors()) # default 5 comps, 50 neighbors step_isomap(rec, all_numeric_predictors(), num_terms = 3) # keep 3 components step_isomap(rec, x, y, z) # named columns only step_isomap(rec, all_numeric_predictors(), neighbors = 10) # smaller k for the graph step_isomap(rec, all_numeric_predictors(), prefix = "iso_") # rename Isomap1 to iso_1 prep(rec); bake(rec, new_data = NULL) # train, then apply tidy(prep(rec), number = 1) # list transformed columns
Need explanation? Read on for examples and pitfalls.
What step_isomap() does
step_isomap() runs Isomap manifold embedding as a recipe step. Linear PCA flattens predictors onto straight axes through the variance, which works when the data sits roughly inside a hyperplane. Isomap handles the harder case where rows trace a curved surface, like points wrapped around a Swiss roll. It first builds a k-nearest-neighbor graph on the training rows, measures shortest path distances along that graph, then runs classical multidimensional scaling on those geodesic distances to keep the manifold's true shape in a low-dimensional embedding.
Like every recipes step, step_isomap() works in two stages. During prep() it builds the k-NN graph, computes geodesic distances, and stores the learned embedding. bake() then projects training data, or new data, onto the same low-dimensional space. The output columns are named Isomap1, Isomap2, and so on, and the original predictors are dropped.
step_isomap() syntax and arguments
step_isomap() is a transformation step controlled mainly by the component count and the neighborhood size. You pick columns directly or, more often, with the tidyselect helper all_numeric_predictors(), since Isomap is defined only for numeric data.
The num_terms argument sets how many embedding dimensions survive in the baked output, and neighbors sets the k that builds the graph. The neighbors value must be smaller than the number of training rows, so it is the argument you tune first on small frames. Full argument detail lives in the recipes step_isomap reference.
step_isomap() examples
Every example follows the prep-then-bake rhythm. The first builds a recipe on mtcars, normalizes the ten numeric predictors, and compresses them to three Isomap components. Normalizing first is essential and is covered in the pitfalls below.
The ten predictors collapse into Isomap1, Isomap2, and Isomap3, while the outcome mpg passes through untouched. With only 32 rows in mtcars the default neighbors = 50 cannot build the graph, so we set it to 5. Calling bake(iso_rec, new_data = test_df) on any frame with the same input columns reuses the stored embedding, so train and test land in the same Isomap space.
With num_terms = 2 the baked frame has three columns: two Isomap components and the outcome. Unlike linear PCA, Isomap does not decompose variance, so you select the dimension count by intent or by tuning rather than by a percent-variance budget.
Eight neighbors on 150 iris rows keeps the graph dense enough to connect all three species without short-circuiting setosa to virginica. The two Isomap columns can then feed a downstream classifier or be plotted for visual inspection.
The number = 2 argument points at the second step in the recipe, step_isomap(). The tidy() table lists the predictors that fed the step rather than a variance breakdown, because manifold components do not decompose variance the way linear ones do.
sklearn.manifold.Isomap inside a Pipeline. The recipes equivalent of Isomap(n_components=3, n_neighbors=5) is step_isomap(num_terms = 3, neighbors = 5).step_isomap() vs other reduction steps
step_isomap() is one of several recipes steps that shrink a wide predictor set, and they leave different geometries behind. Choosing wrong either discards curved structure or pays an interpretability cost you did not need.
| Step | Transformation | Output columns | Best when |
|---|---|---|---|
step_isomap() |
Geodesic distance MDS | Isomap1, ... |
Data lies on a curved low-dimensional manifold |
step_pca() |
Linear orthogonal components | PC1, ... |
Predictors are linearly correlated |
step_kpca() |
Kernel (nonlinear) PCA | kPC1, ... |
Structure is nonlinear but not manifold-shaped |
step_umap() |
Topology-preserving embedding | UMAP1, ... |
Visualization or downstream clustering |
Reach for step_isomap() when the predictors clearly trace a curved surface, such as pose data, sensor trajectories, or any geometry where geodesic distance matters. Stay with step_pca() when the predictors are merely correlated, since it is faster, reproducible, and easier to explain. Choose step_kpca() when the structure is nonlinear but not laid out as a manifold.
neighbors before you tune num_terms. Too few neighbors disconnects the k-NN graph and produces NaN embeddings; too many flatten the manifold by short-circuiting between distant points. A practical starting range is 5 to 15 for small data and 20 to 50 for larger frames.Common pitfalls
Three mistakes account for most step_isomap() confusion.
- Leaving
neighborsat the default 50. On small training frames the k-NN graph cannot even be built. Setneighborssmaller than the row count and small enough to keep neighborhoods local. - Skipping normalization. Isomap measures distances on the predictors, so a column on a large scale dominates the graph. Place
step_normalize()beforestep_isomap()so every predictor contributes evenly. - Applying Isomap to truly linear data. When the predictors are simply correlated in a straight-line sense, Isomap adds cost without improving structure. The k-NN graph plus geodesic MDS solve a problem the data does not have.
step_pca() or step_pls() instead.Try it yourself
Try it: Build a recipe on the built-in USArrests dataset, normalize the numeric predictors, and reduce them to two Isomap components with neighbors = 8. Save the baked training data to ex_iso.
Click to reveal solution
Explanation: USArrests has four numeric columns and no outcome. step_normalize() puts them on a common scale, then step_isomap() with num_terms = 2 and neighbors = 8 projects them onto two manifold components, leaving a two-column frame.
Related recipes functions
step_isomap() rarely appears alone in a recipe. These steps commonly sit alongside it:
step_normalize()centers and scales predictors, the required step beforestep_isomap().step_pca()runs linear PCA, the faster choice when structure is linear.step_kpca()runs kernel PCA, the alternative when structure is nonlinear but not manifold-shaped.step_umap()from theembedpackage produces topology-preserving embeddings tuned for visualization.step_impute_mean()fills missing values, since Isomap cannot handleNAinputs.
FAQ
What does step_isomap() do in R?
step_isomap() is a recipes step that performs Isomap manifold embedding as part of a preprocessing pipeline. During prep() it builds a k-nearest-neighbor graph on the selected numeric predictors, computes geodesic shortest path distances, and runs classical multidimensional scaling on those distances to learn a low-dimensional embedding. During bake() it projects data onto that embedding, producing columns named Isomap1, Isomap2, and so on, while dropping the original predictors.
When should I use step_isomap() instead of step_pca()?
Use step_isomap() when the predictors trace a curved low-dimensional manifold that linear PCA would flatten incorrectly. A classic case is data whose true structure is one-dimensional but wraps through several columns, such as poses, trajectories, or sensor readings along a moving system. If the predictors are merely correlated in a straight-line sense, stay with step_pca(): it is faster, deterministic, and produces interpretable loadings.
How do I choose the neighbors argument in step_isomap()?
The neighbors value sets the k in the k-NN graph that defines the manifold. Too small a value disconnects the graph and yields NaN embeddings; too large a value short-circuits the geodesic distances and collapses the manifold to a straight-line approximation. A practical starting range is 5 to 15 on small frames and 20 to 50 on larger ones. Tune it through tune() if Isomap sits inside a workflow.
Do I need to normalize before step_isomap()?
Yes. Isomap measures Euclidean distance between rows to build its k-NN graph, so a predictor with a large numeric range will dominate that distance and pull the graph toward its own scale. Place step_normalize() before step_isomap() so every predictor contributes evenly. Without it, the components mostly reflect measurement units rather than genuine manifold structure.