recipes step_normalize() in R: Center and Scale Predictors
The recipes step_normalize() function in R centers and scales numeric predictors in one step, leaving each column with a mean of zero and a standard deviation of one. You add it to a recipe(), learn the statistics with prep(), and apply them with bake().
step_normalize(rec, all_numeric_predictors()) # normalize all numeric predictors step_normalize(rec, mpg, hp) # normalize named columns step_normalize(rec, all_numeric()) # normalize every numeric column step_normalize(rec, contains("score")) # normalize by name pattern step_normalize(rec, -all_outcomes()) # normalize all but the outcome prep(rec) |> bake(new_data = NULL) # estimate stats, then apply tidy(prep(rec), number = 1) # inspect the means and SDs
Need explanation? Read on for examples and pitfalls.
What step_normalize() does in R
step_normalize() converts each predictor into a z-score. During prep() it computes the mean and standard deviation of every selected column. During bake() it returns (value - mean) / sd. The result is a column centered at zero with unit variance, which is exactly the transformation that step_center() followed by step_scale() would produce.
Normalization matters because many models compare predictors by magnitude. A car's disp runs into the hundreds while drat sits near three, so distance-based and penalized models let disp dominate the fit. step_normalize() puts every predictor on the same footing inside a modeling pipeline rather than by hand, and it keeps the transformation reproducible on new data.
step_normalize() stores both statistics inside the prepped recipe. When you bake() new rows, it reuses those stored values, so test data is normalized with training statistics and no information leaks across the split.step_normalize() syntax and arguments
step_normalize() attaches a centering-and-scaling operation to a recipe. You pass the recipe first, then the columns to transform, selected with tidyselect helpers.
The arguments you will actually touch:
| Argument | Purpose |
|---|---|
recipe |
The recipe object the step is added to. |
... |
Columns to normalize, chosen with selectors like all_numeric_predictors(). |
na_rm |
If TRUE (default), missing values are dropped when computing the mean and SD. |
means |
Filled in by prep(); holds the estimated mean per column. |
sds |
Filled in by prep(); holds the estimated standard deviation per column. |
skip |
If TRUE, the step is ignored when baking new data. Leave FALSE for normalization. |
Normalizing predictors: worked examples
Build the recipe, prep it, then bake. A recipe is only a plan until prep() estimates the statistics from data. This first example normalizes every numeric predictor in mtcars.
The outcome mpg is untouched because all_numeric_predictors() excludes it. Values now sit around zero, with negative numbers for below-average cars and positive numbers for above-average ones. To confirm the step worked, check the mean and standard deviation of each result column.
Every column now has a mean of zero and a standard deviation of one. To see the actual statistics the recipe learned, call tidy() on the prepped recipe with the step number.
A single tidy() call returns both the means and the standard deviations, tagged in the statistic column. This is the practical advantage of step_normalize() over chaining two separate steps: one step, one summary, both statistics.
step_normalize() vs step_center() vs step_scale()
Pick the step that matches the transformation you need. Centering, scaling, and normalizing are related but distinct, and recipes gives each its own step.
| Step | What it does | Resulting column |
|---|---|---|
step_normalize() |
Centers and scales together | Mean 0, SD 1 |
step_center() |
Subtracts the mean | Mean 0, original spread |
step_scale() |
Divides by the standard deviation | SD 1, original center |
step_range() |
Rescales to a fixed interval | Bounded, default 0 to 1 |
If you want both mean zero and unit variance, reach for step_normalize() instead of adding step_center() and step_scale() separately. It is shorter to write, the tidy() output covers both statistics, and there is no risk of accidentally selecting different columns in each step. Use a bare step_center() or step_scale() only when you specifically want to preserve one of the two original properties.
step_YeoJohnson() or step_BoxCox() first when predictors are heavily skewed, then normalize. Subtracting the mean and dividing by the SD shifts and shrinks the numbers but leaves the shape of the distribution exactly as lopsided as before.Common pitfalls with step_normalize()
Watch which columns you select and when the step runs. The most frequent mistakes come from normalizing the wrong columns or placing the step at the wrong point in the recipe.
- Normalizing the outcome.
all_numeric()includes the response variable. Useall_numeric_predictors()so the model still trains and predicts on the original target scale. - Normalizing dummy variables. If
step_dummy()runs beforestep_normalize(), the 0/1 indicator columns get centered and scaled too, which makes them harder to interpret. Normalize before creating dummies, or select numeric columns explicitly. - Forgetting to prep. Calling
bake()on a recipe that was never prepped throws an error, because the means and SDs have not been estimated yet.
prep() use training data only.Try it yourself
Try it: Normalize only the hp and wt columns of mtcars in a recipe, prep it, and save the baked result to ex_norm.
Click to reveal solution
Explanation: Passing bare column names to step_normalize() limits the step to just hp and wt. After prep() estimates their means and SDs and bake() applies the z-score, the hp column has a mean of zero and a standard deviation of one.
Related recipes steps
step_normalize() is one of several recipes preprocessing steps. These pair naturally with it in a tidymodels workflow:
- step_center() subtracts the mean without touching the spread.
- step_scale() divides by the SD without shifting the center.
- step_range() rescales predictors to a fixed interval.
- step_YeoJohnson() reduces skew before normalizing.
- step_zv() drops zero-variance columns that cannot be normalized.
step_normalize() is (df - df.mean()) / df.std(), or scikit-learn's StandardScaler(). The recipes version differs by learning the mean and SD on training data and reapplying them automatically to new data.FAQ
What does step_normalize() do in R?
step_normalize() is a recipes step that centers and scales numeric predictors at the same time. During prep() it learns each column's mean and standard deviation from the training data. During bake() it applies the z-score transformation (value - mean) / sd, so every selected column ends with a mean of zero and a standard deviation of one. It is the recipes equivalent of standardizing data before modeling.
What is the difference between step_normalize() and step_scale()?
step_scale() only divides each column by its standard deviation, so the column keeps its original mean. step_normalize() does two things: it subtracts the mean and then divides by the SD, leaving the column with mean zero and SD one. Use step_normalize() when a model needs both, such as regularized regression or principal component analysis. Use step_scale() alone when the original center carries meaning you want to keep.
Do I need step_normalize() before every model?
No. Tree-based models such as random forests and boosted trees are invariant to monotonic rescaling, so normalization adds nothing. Normalization matters for distance-based and penalized methods: k-nearest neighbors, support vector machines, principal component analysis, and lasso or ridge regression all let large-magnitude predictors dominate unless predictors share a common scale.
What does tidy() show for a step_normalize() step?
Calling tidy() on a prepped recipe with the step number returns a tibble with one row per statistic per column. The statistic column reads either mean or sd, the value column holds the learned number, and terms names the predictor. This lets you audit exactly which means and standard deviations the recipe will reuse when baking new data.