recipes step_BoxCox() in R: Normalize Skewed Predictors
The recipes step_BoxCox() function applies a Box-Cox transformation that reshapes skewed, strictly positive predictors toward a more normal distribution. It estimates the best power transform per column during prep() and reuses it on new data.
step_BoxCox(rec, all_numeric_predictors()) # transform every numeric predictor step_BoxCox(rec, income, latency) # transform named columns only step_BoxCox(rec, limits = c(-3, 3)) # narrow the lambda search range step_BoxCox(rec, num_unique = 10) # require 10+ unique values prep(rec) |> tidy(number = 1) # view the estimated lambdas recipe(y ~ ., data = df) |> step_BoxCox(...) # add the step inside a recipe
Need explanation? Read on for examples and pitfalls.
What step_BoxCox() does
step_BoxCox() adds a Box-Cox transformation step to a recipe. It does not transform data on its own. The step records your intent, and the actual lambda values are estimated later when you call prep() on the recipe with training data.
The Box-Cox transformation raises each value to a power, lambda, chosen to make the column as close to normal as possible. A lambda near 0 acts like a log transform, lambda near 1 leaves the data almost unchanged, and values in between stretch or compress the distribution. The recipe estimates one lambda per selected column by maximizing a profile likelihood.
Because the lambda is learned from the training set and then frozen, bake() applies the exact same transformation to test data or new observations. That separation is what keeps a tidymodels workflow free of data leakage.
step_BoxCox() only schedules the transformation. prep() learns the lambdas from training data, and bake() applies them. Keeping estimation and application apart is what makes the same transform reproducible across resamples.prep() cannot estimate a lambda for it, leaves the column untransformed, and emits a warning. Use step_YeoJohnson() for data that includes non-positive values.step_BoxCox() syntax and arguments
step_BoxCox() takes column selectors plus a few tuning arguments. The selectors choose which columns to transform, and the remaining arguments control how the lambda search runs.
The ... argument accepts any tidyselect selector, so all_numeric_predictors() or bare column names both work. The limits argument bounds the lambda search, and num_unique protects against transforming near-constant or discrete columns. You rarely set lambdas or trained by hand, because prep() fills them in.
Transform skewed predictors: four examples
Every example below uses data generated inside the code block. No downloads are needed, and the skewed columns make the effect of the transformation easy to see.
Example 1: Measure the skew before transforming
Start by confirming the predictors really are skewed. A simple skewness helper shows how far each column leans before any step runs.
Both predictors have a strong positive skew, with long right tails. A skewness near 0 would mean a symmetric, roughly normal shape, so these columns are good candidates for a Box-Cox transformation.
Example 2: Add step_BoxCox() to a recipe
Build a recipe and attach the step with a selector. The all_numeric_predictors() selector picks every numeric predictor, leaving the outcome y alone.
The printout confirms the recipe has one outcome and two predictors, with a single Box-Cox operation queued. No lambdas exist yet, because the recipe has not been prepped.
Example 3: Prep, bake, and check the result
prep() estimates the lambdas and bake() applies them. Re-measuring skewness on the baked data shows how far the transformation moved each column.
Both columns now have a skewness near 0, meaning the long right tails are gone and the distributions are close to symmetric. The new_data = NULL argument tells bake() to return the already-prepped training data.
Example 4: Inspect the estimated lambdas
tidy() reveals the lambda chosen for each column. Pass the step number to pull its estimated parameters.
Each value is the lambda fitted for that column. Both lie between 0 and 1, so the transformation behaves between a log transform and the identity. These frozen lambdas are what bake() reuses on any future data.
step_BoxCox() vs other transformation steps
recipes ships several transformation steps that overlap with Box-Cox. Each one suits a different data condition, so the right choice depends on your column.
| Step | What it does | Use when |
|---|---|---|
step_BoxCox() |
Learns the best power transform | Predictors are strictly positive and skewed |
step_YeoJohnson() |
Power transform that allows 0 and negatives | Data includes zeros or negative values |
step_log() |
Fixed log transform | You want a simple, fixed compression |
step_normalize() |
Centers and scales to z-scores | You need standardized, not reshaped, data |
step_range() |
Rescales to a fixed interval | A model needs inputs bounded in 0 to 1 |
The decision rule is short. Reach for step_BoxCox() when positive predictors are skewed, switch to step_YeoJohnson() the moment a column can be zero or negative, and use step_normalize() when the shape is fine but the scale is not.
Common pitfalls
Three mistakes catch most newcomers to step_BoxCox(). Each one below shows the problem and the fix.
The most common is feeding the step a column with non-positive values. Box-Cox is undefined at or below zero, so prep() fails to estimate a lambda and leaves the column as-is with a warning.
The second pitfall is forgetting to prep() before bake(). A recipe with a raw, unprepped step_BoxCox() has no lambdas, so bake() has nothing to apply. The third is selecting discrete or near-constant columns: if a column has fewer distinct values than num_unique, the step quietly skips it rather than fitting an unstable lambda.
library(recipes) call makes it available, and library(tidymodels) loads it too. Unlike modeling steps, it needs no extra engine package, because the lambda search runs inside recipes itself.Try it yourself
Try it: Add step_BoxCox() to a recipe for the mtcars data, transforming only the disp and hp columns, then prep it. Save the prepped recipe to ex_prep.
Click to reveal solution
Explanation: Naming disp and hp in step_BoxCox() limits the transformation to those two predictors. prep() estimates one lambda per column on the mtcars training data, and tidy() reads them back as a tibble.
Related recipes functions
step_BoxCox() works alongside the rest of the recipes preprocessing family. These steps cover the neighboring tasks in a feature-engineering pipeline.
step_YeoJohnson()applies a power transform that allows zero and negative values.step_log()applies a fixed log transform without estimating a parameter.step_normalize()centers and scales numeric predictors to z-scores.step_range()rescales numeric predictors into a fixed interval.prep()estimates every step's parameters from training data.
FAQ
What is the difference between step_BoxCox() and step_YeoJohnson()?
Both estimate a power transformation that reshapes a skewed column toward normality. The difference is the input range they accept. step_BoxCox() requires every value to be strictly positive, because the Box-Cox formula is undefined at or below zero. step_YeoJohnson() uses an extended formula that handles zeros and negative values, so it is the safer default when a column can dip below zero, such as a profit or temperature measurement.
Does step_BoxCox() work with negative values?
No. A Box-Cox transformation is only defined for strictly positive data. If a selected column contains a zero or a negative number, prep() cannot estimate a lambda for it, leaves that column untransformed, and prints a warning. Either filter or shift the column so all values are positive, or switch to step_YeoJohnson(), which is built to accept the full real line.
How do I see the lambda values step_BoxCox() chose?
Call tidy() on the prepped recipe and pass the step number, as in tidy(prepped_recipe, number = 1). The result is a tibble with one row per transformed column, where the value column holds the estimated lambda. A lambda near 0 means the transform behaves like a log, while a lambda near 1 means the column was barely changed.
Should step_BoxCox() run before or after step_normalize()?
Run step_BoxCox() first. Box-Cox reshapes a skewed distribution, and step_normalize() then centers and scales the reshaped values to mean 0 and standard deviation 1. Normalizing first would only shift and scale the original skewed shape, leaving the long tail in place, so the ordering matters for models that assume well-behaved predictors.
For the full argument reference, see the recipes step_BoxCox() documentation.