recipes step_center() in R: Center Numeric Predictors
The recipes step_center() function in R centers numeric predictors by subtracting each column's training-set mean, shifting every column's average to zero. You add it to a recipe(), estimate the means with prep(), and apply them with bake().
step_center(rec, all_numeric_predictors()) # center all numeric predictors step_center(rec, mpg, hp) # center named columns step_center(rec, all_numeric()) # center every numeric column step_center(rec, contains("score")) # center by name pattern step_center(rec, all_numeric(), na_rm = TRUE) # ignore NA in the mean prep(rec) |> bake(new_data = NULL) # estimate means, then apply tidy(prep(rec), number = 1) # inspect the estimated means
Need explanation? Read on for examples and pitfalls.
What step_center() does in R
step_center() subtracts the column mean from every value. For a numeric column, it computes the mean during prep() and then, during bake(), returns value - mean. The transformed column has a mean of zero but keeps its original spread and units.
Centering matters because many models behave better when predictors are on a common, zero-anchored scale. Regularized regression, principal components, and gradient-based learners all converge faster and interpret intercepts more cleanly when predictors are centered. step_center() is the recipes way to do this inside a modeling pipeline rather than by hand.
step_center() stores the training means inside the prepped recipe. When you bake() new data, it reuses those stored means, so test rows are transformed with training statistics and no information leaks across the split.step_center() syntax and arguments
step_center() attaches a centering operation to a recipe. You pass the recipe first, then a set of columns selected with tidyselect helpers.
The arguments you will actually touch:
| Argument | Purpose |
|---|---|
recipe |
The recipe object the step is added to. |
... |
Columns to center, chosen with selectors like all_numeric_predictors(). |
na_rm |
If TRUE (default), missing values are dropped when computing the mean. |
means |
Filled in by prep(); holds the estimated mean per column. |
skip |
If TRUE, the step is ignored when baking new data. Leave FALSE for centering. |
Centering predictors: worked examples
Build the recipe, prep it, then bake. A recipe is just a plan until prep() estimates the statistics from data. The first example centers every numeric predictor in mtcars.
The outcome mpg is untouched because all_numeric_predictors() excludes it. To confirm the centering worked, check the column means of the result.
Every mean is zero apart from floating-point dust. To see the actual values subtracted, call tidy() on the prepped recipe with the step number.
Centering also works across a train and test split. Estimate the recipe on training rows, then bake the held-out rows.
The test mean is not zero, and that is correct. The held-out rows are shifted by the training mean, so their average reflects the genuine difference between the two samples.
step_center() vs step_scale() vs step_normalize()
Pick the step that matches the transformation you need. Centering, scaling, and normalizing are related but distinct, and recipes gives each its own step.
| Step | What it does | Resulting column |
|---|---|---|
step_center() |
Subtracts the mean | Mean 0, original spread |
step_scale() |
Divides by the standard deviation | SD 1, original center |
step_normalize() |
Centers and scales together | Mean 0, SD 1 |
step_range() |
Rescales to a fixed interval | Bounded, default 0 to 1 |
If you want both mean zero and unit variance, use step_normalize() rather than chaining step_center() and step_scale(). It is shorter, and one tidy() call returns both statistics.
step_YeoJohnson() or step_BoxCox() first when predictors are skewed, then center. Centering a skewed column does not fix the skew, it only relocates it.Common pitfalls with step_center()
Watch what you select. The most frequent mistakes come from choosing the wrong columns or skipping prep().
- Centering the outcome.
all_numeric()includes the response variable. Useall_numeric_predictors()so the model still trains on the original target scale. - Forgetting to prep. Calling
bake()on a recipe that was never prepped throws an error, because the means have not been estimated yet. - Centering categorical dummies after the fact. If
step_dummy()runs beforestep_center(), the 0/1 indicator columns get centered too, which is rarely what you want.
prep() use training data only.Try it yourself
Try it: Center only the hp and wt columns of mtcars in a recipe, prep it, and save the baked result to ex_centered.
Click to reveal solution
Explanation: Passing bare column names to step_center() limits the step to just hp and wt. After prep() estimates their means and bake() applies them, the hp column averages to zero.
Related recipes steps
step_center() is one of several recipes preprocessing steps. These pair naturally with it in a tidymodels workflow:
- step_scale() divides predictors by their standard deviation.
- step_normalize() centers and scales in a single step.
- step_range() rescales predictors to a fixed interval.
- step_YeoJohnson() reduces skew before centering.
- step_zv() drops zero-variance columns that cannot be centered meaningfully.
step_center() is df - df.mean(), or scikit-learn's StandardScaler(with_std=False). The recipes version differs by learning the mean on training data and reapplying it automatically to new data.FAQ
Does step_center() change the outcome variable?
Not when you select columns with all_numeric_predictors(), which is the recommended selector. That helper excludes the variable on the left of your recipe formula. If you instead use all_numeric(), the outcome is included and gets centered, which shifts your target away from its real scale. For almost all modeling work, keep the outcome on its original units and center predictors only.
What is the difference between step_center() and scale() in R?
Base R's scale() centers and, by default, also divides by the standard deviation, returning a matrix with attributes. step_center() only subtracts the mean, returns a data frame, and crucially stores the training mean inside a recipe. That means the same transformation is reapplied to new data automatically, which scale() cannot do on its own.
Do I need to center predictors before every model?
No. Tree-based models such as random forests and boosted trees are invariant to centering, so the step adds nothing. Centering helps regularized regression, principal component analysis, k-nearest neighbors, and neural networks, where predictor location and scale affect the fit. Add step_center() when your model is distance-based or penalized.
How does step_center() handle missing values?
By default na_rm = TRUE, so missing values are ignored when the mean is computed during prep(). The mean reflects only the observed values in each column. The NA cells themselves remain NA after baking, because centering shifts existing numbers but cannot invent a value. Impute first with a step such as step_impute_mean() if you need complete columns.