recipes step_scale() in R: Scale Predictors to Unit SD
The recipes step_scale() function in R divides each numeric predictor by its training-set standard deviation, giving every column unit variance while leaving its mean untouched. You add it to a recipe(), estimate the spreads with prep(), and apply them with bake().
step_scale(rec, all_numeric_predictors()) # scale all numeric predictors step_scale(rec, mpg, hp) # scale named columns step_scale(rec, all_numeric()) # scale every numeric column step_scale(rec, contains("score")) # scale by name pattern step_scale(rec, all_numeric(), factor = 2) # divide by two SDs prep(rec) |> bake(new_data = NULL) # estimate SDs, then apply tidy(prep(rec), number = 1) # inspect the estimated SDs
Need explanation? Read on for examples and pitfalls.
What step_scale() does in R
step_scale() divides every value in a column by that column's standard deviation. During prep() it computes the standard deviation of each selected column. During bake() it returns value / sd. The transformed column ends up with a standard deviation of one, but its mean and sign are unchanged.
Scaling matters because predictors measured on wildly different units can distort models that depend on magnitude. A car's disp runs into the hundreds while drat sits near three, and any distance-based or penalized model will let disp dominate. step_scale() is the recipes way to put predictors on a comparable spread inside a modeling pipeline rather than by hand.
step_scale() stores the training standard deviations inside the prepped recipe. When you bake() new data, it reuses those stored values, so test rows are divided by training statistics and no information leaks across the split.step_scale() syntax and arguments
step_scale() attaches a scaling operation to a recipe. You pass the recipe first, then a set of columns selected with tidyselect helpers.
The arguments you will actually touch:
| Argument | Purpose |
|---|---|
recipe |
The recipe object the step is added to. |
... |
Columns to scale, chosen with selectors like all_numeric_predictors(). |
factor |
Divisor multiplier. 1 (default) divides by one SD; 2 divides by two SDs. |
na_rm |
If TRUE (default), missing values are dropped when computing the SD. |
sds |
Filled in by prep(); holds the estimated standard deviation per column. |
skip |
If TRUE, the step is ignored when baking new data. Leave FALSE for scaling. |
Scaling predictors: worked examples
Build the recipe, prep it, then bake. A recipe is just a plan until prep() estimates the statistics from data. The first example scales every numeric predictor in mtcars.
The outcome mpg is untouched because all_numeric_predictors() excludes it. Notice the scaled values stay positive: scaling changes spread, not location. To confirm the step worked, check the standard deviation of each result column.
Every column now has a standard deviation of exactly one. To see the actual divisors, call tidy() on the prepped recipe with the step number.
The factor argument changes the divisor. Setting factor = 2 divides by two standard deviations, a convention recommended for comparing continuous predictors with binary ones on the same footing.
Dividing by two SDs halves the resulting spread, so each column lands at a standard deviation of 0.5. With factor = 1 the columns would each read 1.
step_scale() vs step_center() vs step_normalize()
Pick the step that matches the transformation you need. Scaling, centering, and normalizing are related but distinct, and recipes gives each its own step.
| Step | What it does | Resulting column |
|---|---|---|
step_scale() |
Divides by the standard deviation | SD 1, original center |
step_center() |
Subtracts the mean | Mean 0, original spread |
step_normalize() |
Centers and scales together | Mean 0, SD 1 |
step_range() |
Rescales to a fixed interval | Bounded, default 0 to 1 |
If you want both unit variance and mean zero, use step_normalize() rather than chaining step_center() and step_scale(). It is shorter, and one tidy() call returns both statistics. Reach for a bare step_scale() only when you specifically want to preserve each column's original mean and sign.
step_YeoJohnson() or step_BoxCox() first when predictors are heavily skewed, then scale. Dividing a skewed column by its SD shrinks the numbers but leaves the shape of the distribution exactly as lopsided as before.Common pitfalls with step_scale()
Watch what you select and when you scale. The most frequent mistakes come from choosing the wrong columns or scaling at the wrong point in the recipe.
- Scaling the outcome.
all_numeric()includes the response variable. Useall_numeric_predictors()so the model still trains and predicts on the original target scale. - Forgetting to prep. Calling
bake()on a recipe that was never prepped throws an error, because the standard deviations have not been estimated yet. - Scaling dummy variables. If
step_dummy()runs beforestep_scale(), the 0/1 indicator columns get divided by their SD too, which distorts their interpretation. Scale before creating dummies, or select numeric columns explicitly.
prep() use training data only.Try it yourself
Try it: Scale only the hp and wt columns of mtcars in a recipe, prep it, and save the baked result to ex_scaled.
Click to reveal solution
Explanation: Passing bare column names to step_scale() limits the step to just hp and wt. After prep() estimates their standard deviations and bake() divides by them, the hp column has a standard deviation of one.
Related recipes steps
step_scale() is one of several recipes preprocessing steps. These pair naturally with it in a tidymodels workflow:
- step_center() subtracts the mean instead of dividing by the SD.
- step_normalize() centers and scales in a single step.
- step_range() rescales predictors to a fixed interval.
- step_YeoJohnson() reduces skew before scaling.
- step_zv() drops zero-variance columns that cannot be scaled.
step_scale() is df / df.std(), or scikit-learn's StandardScaler(with_mean=False). The recipes version differs by learning the standard deviation on training data and reapplying it automatically to new data.FAQ
What is the difference between step_scale() and step_normalize()?
step_scale() only divides each column by its standard deviation, so the column ends with unit variance but keeps its original mean. step_normalize() does two things: it subtracts the mean and divides by the SD, leaving the column with mean zero and SD one. Use step_normalize() when a model needs both, such as regularized regression or principal component analysis. Use step_scale() alone when the original center carries meaning you want to keep.
Does step_scale() center the data?
No. step_scale() is a pure division step. It changes the spread of a column but never shifts its location, so a column of positive values stays positive after scaling. If you need the mean shifted to zero as well, add step_center() to the recipe or use step_normalize(), which combines both operations into one step and one tidy() summary.
What is the factor argument in step_scale()?
The factor argument controls how many standard deviations the divisor represents. With the default factor = 1, each value is divided by one SD and the result has unit variance. With factor = 2, values are divided by two SDs, which is a convention from Andrew Gelman for putting continuous predictors on a scale comparable to centered binary predictors. Only 1 and 2 are accepted.
Should I scale predictors before every model?
No. Tree-based models such as random forests and boosted trees are invariant to scaling, so the step adds nothing. Scaling matters for distance-based and penalized methods: k-nearest neighbors, support vector machines, principal component analysis, and regularized regression all let large-magnitude predictors dominate unless you put columns on a common spread first.