recipes step_YeoJohnson() in R: Transform Skewed Data
The recipes step_YeoJohnson() function applies a Yeo-Johnson power transformation that reshapes skewed predictors toward a more normal distribution. Unlike Box-Cox, it accepts zeros and negative values, so it works on the full real line.
step_YeoJohnson(rec, all_numeric_predictors()) # transform every numeric predictor step_YeoJohnson(rec, profit, sensor) # transform named columns only step_YeoJohnson(rec, limits = c(-3, 3)) # narrow the lambda search range step_YeoJohnson(rec, num_unique = 10) # require 10+ unique values step_YeoJohnson(rec, na_rm = TRUE) # drop NAs before estimating lambda prep(rec) |> tidy(number = 1) # view the estimated lambdas recipe(y ~ ., data = df) |> step_YeoJohnson(...) # add the step inside a recipe
Need explanation? Read on for examples and pitfalls.
What step_YeoJohnson() does
step_YeoJohnson() adds a Yeo-Johnson transformation step to a recipe. It does not change any data at the moment you call it. The step records which columns to transform, and the per-column power parameters are estimated later, when prep() runs on the training data.
The transformation raises each value to a power, lambda, chosen to make the column as close to normal as possible. It extends Box-Cox with a separate formula branch for values at or below zero, so one column can mix negatives, zeros, and positives and still receive a consistent transform.
Because the lambda is learned once on the training set and then frozen, bake() applies the exact same transformation to test data or new observations. That separation is what keeps a tidymodels workflow free of data leakage.
step_YeoJohnson() over step_BoxCox().step_YeoJohnson() syntax and arguments
step_YeoJohnson() takes column selectors plus a few tuning arguments. The selectors decide which columns to transform, and the remaining arguments control how the lambda search behaves.
The ... argument accepts any tidyselect selector, so all_numeric_predictors() or bare column names work. The limits argument bounds the lambda search, and num_unique guards against transforming near-constant or discrete columns. You rarely set lambdas or trained by hand, because prep() fills them in.
Transform skewed data: four examples
Every example builds its data inside the code block. The columns are deliberately skewed and include negative values, so the Yeo-Johnson effect is easy to see.
Example 1: Measure skew on data that has negatives
Start by confirming the predictors are skewed and dip below zero. A small skewness helper and range() together show why Box-Cox is not an option here.
Both predictors have a strong positive skew, and profit runs well into negative territory. A Box-Cox transformation cannot touch a column with negative values, which makes this a natural job for step_YeoJohnson().
Example 2: Add step_YeoJohnson() to a recipe
Build a recipe and attach the step with a selector. The all_numeric_predictors() selector picks every numeric predictor and leaves the outcome y untouched.
The printout confirms one outcome and two predictors, with a single Yeo-Johnson operation queued. No lambdas exist yet, because the recipe has not been prepped.
Example 3: Prep, bake, and recheck the skew
prep() estimates the lambdas and bake() applies them. Re-measuring skewness on the baked data shows how far the transformation moved each column.
Both columns now have a skewness near 0, meaning the long right tails are gone and the distributions are close to symmetric. The new_data = NULL argument tells bake() to return the already-prepped training data.
Example 4: Inspect the estimated lambdas
tidy() reveals the lambda chosen for each column. Pass the step number to pull its estimated parameters.
Each value is the lambda fitted for that column. Both lie below 1, so the transformation compresses the long right tail rather than stretching it. These frozen lambdas are exactly what bake() reuses on any future data.
step_YeoJohnson() vs other transformation steps
recipes ships several transformation steps that overlap with Yeo-Johnson. Each one suits a different data condition, so the right choice depends on the column in front of you.
| Step | What it does | Use when |
|---|---|---|
step_YeoJohnson() |
Power transform across the full real line | Skewed data with zeros or negatives |
step_BoxCox() |
Power transform for positive data only | Predictors are strictly positive and skewed |
step_log() |
Fixed log transform | A simple, fixed compression is enough |
step_sqrt() |
Fixed square-root transform | Mild right skew in non-negative counts |
step_normalize() |
Centers and scales to z-scores | The shape is fine but the scale is not |
The decision rule is short. Use step_YeoJohnson() when a skewed column can be zero or negative, step_BoxCox() when every value is strictly positive, and step_normalize() when only the scale needs fixing.
step_YeoJohnson() is PowerTransformer(method="yeo-johnson"). Both estimate one lambda per feature by maximum likelihood and both default to the Yeo-Johnson family, so a recipe step ports across the two ecosystems with almost no change in behavior.Common pitfalls
Three mistakes catch most newcomers to step_YeoJohnson(). Each one below shows the problem and the fix.
The most common is selecting a non-numeric column. Yeo-Johnson is a numeric transformation, so a factor or character column passed to the step makes prep() fail rather than silently skip.
The second pitfall is forgetting to prep() before bake(). A recipe holding a raw, unprepped step_YeoJohnson() has no lambdas, so bake() has nothing to apply. The third is treating the output as interpretable: the transformed column sits on a power scale, not its original units, so report model results on a back-transformed scale.
step_mutate() or filtering before you reach for step_YeoJohnson().Try it yourself
Try it: Build a recipe for a data frame ex_df whose score column contains negative values. Add step_YeoJohnson() to transform score, prep it, and read back the estimated lambda. Save the prepped recipe to ex_prep.
Click to reveal solution
Explanation: The score column includes negative values, so Box-Cox could not transform it. Yeo-Johnson estimates a single lambda that reshapes the column across zero, and tidy() reads that lambda back as a tibble.
Related recipes functions
step_YeoJohnson() works alongside the rest of the recipes preprocessing family. These steps cover the neighboring tasks in a feature-engineering pipeline.
step_BoxCox()applies a power transform for strictly positive predictors.step_log()applies a fixed log transform without estimating a parameter.step_normalize()centers and scales numeric predictors to z-scores.step_range()rescales numeric predictors into a fixed interval.prep()estimates every step's parameters from training data.
FAQ
Can step_YeoJohnson() handle negative and zero values?
Yes, and that is its defining feature. The Yeo-Johnson transformation uses an extended formula with a separate branch for values at or below zero, so a column can mix negative, zero, and positive numbers and still receive one consistent power transform. This is the key difference from step_BoxCox(), which is undefined for any value that is not strictly positive. For predictors like profit or temperature that routinely dip below zero, step_YeoJohnson() is the safe default.
What is the difference between step_YeoJohnson() and step_BoxCox()?
Both estimate a power transformation that reshapes a skewed column toward normality, and for strictly positive data they behave almost identically. The difference is the input range they accept. step_BoxCox() requires every value to be greater than zero, while step_YeoJohnson() works across the full real line. If you are unsure whether a column will ever contain zeros or negatives, choosing Yeo-Johnson avoids a silent skipped transformation later.
Should step_YeoJohnson() run before or after step_normalize()?
Run step_YeoJohnson() first. Yeo-Johnson reshapes a skewed distribution, and step_normalize() then centers and scales the reshaped values to mean 0 and standard deviation 1. Normalizing first would only shift and scale the original skewed shape, leaving the long tail intact, so the order matters for any model that assumes well-behaved predictors.
How do I reverse a Yeo-Johnson transformation?
recipes does not ship a built-in inverse for step_YeoJohnson(), because the step is meant for predictors that feed a model rather than values you report directly. You rarely need to invert a predictor. If you transformed an outcome variable and need predictions on the original scale, apply the inverse Yeo-Johnson formula by hand using the lambda from tidy(), or keep the outcome untransformed and let the model handle the skew.
For the full argument reference, see the step_YeoJohnson() documentation.