recipes step_log() in R: Log-Transform Skewed Predictors
The recipes step_log() function in R applies a log transformation to numeric predictors, compressing right-skewed columns so a model sees a more symmetric spread. You add it to a recipe(), finalize it with prep(), and apply it with bake().
step_log(rec, all_numeric_predictors()) # natural log of all predictors step_log(rec, disp, hp) # log named columns step_log(rec, all_numeric_predictors(), base = 10) # log base 10 step_log(rec, all_numeric_predictors(), base = 2) # log base 2 step_log(rec, units, offset = 1) # add 1 first, handles zeros step_log(rec, amt, signed = TRUE) # signed log, handles negatives prep(rec) |> bake(new_data = NULL) # apply the transform
Need explanation? Read on for examples and pitfalls.
What step_log() does in R
step_log() replaces each selected column with its logarithm. During bake() it applies log(x, base) to every value, optionally adding an offset first. The default base is the natural constant e, so a plain step_log() call produces natural logs. The outcome variable is left alone when you select columns with all_numeric_predictors().
Logging is the standard fix for a right-skewed predictor, where most values cluster low and a long tail stretches high. The transformation pulls that tail in, turning multiplicative spacing into additive spacing. Counts, prices, incomes, and population sizes are common candidates. Models that assume roughly symmetric inputs, such as linear regression, often fit better once a heavy tail is logged.
step_normalize() or step_BoxCox(), it estimates nothing from the training set. The formula log(x, base) is fixed, so prep() only records which columns to touch. That makes the step fast and perfectly reproducible, but it also means a zero or negative value in new data will silently produce -Inf or NaN.step_log() syntax and arguments
step_log() attaches a logarithm operation to a recipe. You pass the recipe first, then the columns to transform, chosen with tidyselect helpers or bare names.
The arguments you will actually touch:
| Argument | Purpose |
|---|---|
recipe |
The recipe object the step is added to. |
... |
Columns to transform, chosen with selectors or bare names. |
base |
Base of the logarithm. Default exp(1), the natural log. |
offset |
Constant added to each value before the log. Default 0. |
signed |
If TRUE, uses sign(x) * log(abs(x)); values in (-1, 1) map to 0. |
columns |
Filled in by prep(); the resolved column names. |
skip |
If TRUE, the step is ignored when baking new data. Leave FALSE. |
Log-transforming predictors: worked examples
Build the recipe, prep it, then bake. A recipe is a plan until prep() finalizes it. The first example takes the natural log of every numeric predictor in mtcars.
The outcome mpg keeps its original scale because all_numeric_predictors() excludes it. To log specific columns in a different base, name them and set base. Base 10 maps each order of magnitude to one unit, which many readers find easier to interpret.
The offset argument adds a constant before the log, the standard trick for columns that contain zeros. The log of zero is undefined, so step_log(offset = 1) behaves like the log1p() function and keeps every row finite.
For columns that also hold negative values, set signed = TRUE. The signed log takes sign(x) * log(abs(x)) and maps any value between -1 and 1 to 0, so the result stays finite across the whole real line.
step_log() vs step_BoxCox() vs step_YeoJohnson()
Pick the step that matches your data and your goal. All three reduce skew, but they differ in how much they estimate and what input values they tolerate.
| Step | What it does | Handles zeros or negatives |
|---|---|---|
step_log() |
Applies a fixed log(x, base) |
Only with offset or signed |
step_BoxCox() |
Estimates the best power transform | No, needs strictly positive data |
step_YeoJohnson() |
Estimates a power transform | Yes, defined for all real values |
step_sqrt() |
Applies a fixed square root | Needs non-negative data |
Use step_log() when you already know a log makes sense and you want a transformation that is fast, fixed, and easy to explain. Reach for step_BoxCox() or step_YeoJohnson() when you would rather let the data choose the exponent, and step_YeoJohnson() specifically when the column has zeros or negatives. The estimated steps are more flexible but harder to communicate, since the fitted power is rarely a round number.
Common pitfalls with step_log()
Most step_log() bugs trace back to invalid inputs or the wrong column set. Because the transformation is static, recipes will not warn you before producing a broken column.
- Zeros and negatives.
log(0)is-Infandlog(-1)isNaN. A predictor with either value needsoffsetto shift it positive orsigned = TRUEto use the signed variant. - Transforming the outcome by accident.
all_numeric()includes the response variable. Useall_numeric_predictors()so the model still trains and predicts on the untransformed target. - New data with unexpected values. Training data may be all positive while a later batch contains a zero.
step_log()will not error; it will quietly emit-Infand corrupt the prediction. Validate new data or build the offset in from the start.
-Inf will not stop the pipeline. bake() returns the infinite value, the model receives it, and the failure surfaces far downstream as an unexplained NA prediction. Inspect the range of every logged predictor right after prep().Try it yourself
Try it: Log-transform the disp column of mtcars using base 10 in a recipe, prep it, and save the baked result to ex_logged.
Click to reveal solution
Explanation: Passing disp and base = 10 limits the step to one column and switches the logarithm base. After prep() and bake(), the disp values are replaced by their base-10 logs, so the first entry of 160 becomes about 2.204.
Related recipes steps
step_log() is one of several recipes transformation steps. These pair naturally with it in a tidymodels workflow:
- step_BoxCox() estimates the best power transform for positive data.
- step_YeoJohnson() estimates a power transform that allows zeros and negatives.
- step_sqrt() applies a fixed square-root transform.
- step_normalize() centers and scales to mean 0 and SD 1.
- recipe() creates the preprocessing object every step attaches to.
step_log() is np.log(df), or np.log1p(df) when the column contains zeros. The recipes version differs by attaching the transform to a reusable recipe, so the same logic runs on training and new data without copy-pasting.FAQ
What does step_log() do in the recipes package?
step_log() adds a log transformation to a recipe. When you bake() the prepped recipe, every value in the selected columns is replaced by log(x, base), optionally after adding an offset. The default base is e, the natural log. It is most often used to compress right-skewed numeric predictors so models that assume symmetric inputs fit more reliably. Because it is a static transform, it estimates nothing from the data and runs identically on training and new rows.
How do I take a log transform of zeros with step_log()?
The log of zero is -Inf, so a column with zeros needs the offset argument. Setting step_log(units, offset = 1) adds 1 to every value before the log, exactly like the log1p() function, which keeps zero rows finite at log(1) = 0. If the column also has negative values, use signed = TRUE instead, which applies sign(x) * log(abs(x)) and maps values between -1 and 1 to 0.
Should I use step_log() or step_BoxCox()?
Use step_log() when you already know a log is appropriate and want a fixed, transparent transformation that is easy to explain to others. Use step_BoxCox() when you would rather let the data choose the exponent automatically. Box-Cox estimates a power that best symmetrizes each column, which can fit better but produces a non-round exponent that is harder to communicate. Box-Cox also requires strictly positive data, while step_log() can handle zeros and negatives through offset and signed.
Does step_log() transform the outcome variable?
Not if you select columns with all_numeric_predictors(), which excludes the response. If you use all_numeric() instead, the outcome is included and gets logged too, which is rarely what you want. To log the response on purpose, name it explicitly and remember that predictions then come back on the log scale and must be exponentiated to return to the original units.