recipes recipe() in R: Build a Preprocessing Blueprint
The recipes recipe() function in R defines a preprocessing blueprint: a reusable plan for turning raw data into model-ready features. It records which columns are outcomes and predictors, then collects the transformation steps you apply later.
recipe(y ~ ., data = train) # formula interface recipe(train) # bare data, roles assigned later rec |> update_role(id, new_role = "ID") # tag a non-predictor rec |> step_normalize(all_numeric_predictors()) # add a step rec |> step_dummy(all_nominal_predictors()) # encode factors prep(rec, training = train) # estimate from training data bake(prepped, new_data = NULL) # get processed training set
Need explanation? Read on for examples and pitfalls.
What recipe() does
recipe() declares structure, not transformations. When you call recipe(), R does not change a single value in your data. It builds a recipe object that records each column name, its data type, and its role: outcome, predictor, or something custom. The actual transformations come later, as step_*() functions chained onto the recipe.
That separation is the core idea. A recipe is a plan estimated on training data and then applied identically to new data, which stops information from the test set leaking into preprocessing.
recipe() syntax and arguments
recipe() accepts a formula or a bare data frame. The most common call passes a formula and the training data. The left side of the formula is the outcome; the right side lists predictors, with . meaning "all remaining columns".
Key arguments:
formula: a model formula such asSpecies ~ .. The outcome goes left, predictors right.data/x: a data frame or tibble. Only column names and types are read, so a zero-row slice works fine.vars: an optional character vector of column names, used instead of a formula.roles: an optional character vector assigning a role to each entry invars.
The full argument list lives in the recipes reference documentation.
recipe() examples
Start with the formula interface on a built-in dataset. Passing Species ~ . marks Species as the outcome and the four measurement columns as predictors.
Chain step_*() functions to add transformations. Each step is recorded in order. Here two steps center-and-scale the numeric predictors and convert any factor predictors to dummy variables.
Call prep() then bake() to run the recipe. prep() estimates each step's parameters, here the mean and standard deviation, from the training data. bake(new_data = NULL) returns the processed training set.
Use the data-frame interface when there is no clean formula. Passing the data alone leaves every column undeclared; update_role() then tags outcomes and predictors explicitly.
recipe() vs other preprocessing tools
recipe() wins when preprocessing must be reproducible across datasets. Base R and dplyr transform data in place, which is fine for exploration but risks applying different parameters to training and test sets.
| Tool | Best for | Reusable on new data |
|---|---|---|
recipe() + steps |
Model preprocessing pipelines | Yes, via prep() and bake() |
dplyr::mutate() |
One-off, exploratory transforms | No, recomputed each time |
model.matrix() |
Quick design matrix for lm() |
Partly, no stored parameters |
caret::preProcess() |
Older caret workflows | Yes, but caret-only |
The decision rule is simple. If a transformation must use the same parameters on training and test data, use a recipe. For a throwaway column tweak, reach for mutate().
Pipeline built from ColumnTransformer steps. prep() maps to fit() and bake() maps to transform().Common pitfalls
A bare recipe transforms nothing until it is prepped. Three mistakes cause most early confusion.
- Skipping prep(). A recipe object on its own is inert. Calling
bake()on an un-prepped recipe errors. Always runprep()first to estimate the steps. - Selecting the outcome by accident.
all_numeric()includes a numeric outcome, so a step can transform your target. Useall_numeric_predictors()to exclude outcomes. - Prepping on the full dataset. Calling
prep()on combined train-and-test data learns means and standard deviations from test rows too, leaking information. Prep on the training split only.
_predictors() selectors inside steps.Try it yourself
Try it: Build a recipe on mtcars predicting mpg, add a step that normalizes all numeric predictors, then prep and bake it. Save the baked training data to ex_baked.
Click to reveal solution
Explanation: recipe() declares mpg as the outcome, step_normalize() centers and scales every numeric predictor, and prep() plus bake(new_data = NULL) returns the transformed training set. The outcome mpg stays untouched because the selector targets predictors only.
Related recipes functions
recipe() is the entry point to a family of preprocessing functions. These pair with it most often:
step_normalize(): center and scale numeric predictors to mean 0, standard deviation 1.step_dummy(): convert factor predictors into indicator columns.prep(): estimate step parameters from training data.bake(): apply a prepped recipe to any dataset.workflow()andadd_recipe(): bundle a recipe with a model for one-call fitting.
FAQ
What is a recipe in R? A recipe is an object created by the recipes package that describes how to preprocess a dataset before modeling. It stores variable roles and an ordered list of transformation steps. The recipe itself holds no transformed data; it is a plan that prep() estimates and bake() executes. Recipes are part of the tidymodels framework and make preprocessing reproducible across training and test sets.
What is the difference between prep() and bake()? prep() estimates the parameters each step needs, such as column means for centering, using the training data. bake() applies those estimated parameters to produce transformed data. You prep once and bake many times: bake(new_data = NULL) returns the processed training set, while bake(new_data = test) applies the identical transformation to new data without re-estimating anything.
Do I have to use recipe() with workflows? No. A recipe works on its own through prep() and bake(). Pairing it with a workflow is still recommended, because workflow() plus add_recipe() bundles preprocessing and the model together, so a single fit() call handles both and predictions automatically apply the recipe to new data.
What is the difference between recipe() and a formula in lm()? A formula in lm() both selects variables and applies inline transformations at fit time, recomputed for every dataset. recipe() separates the plan from execution: it records steps once, estimates them on training data with prep(), and reuses the stored parameters. This prevents test data from influencing preprocessing.
Can I use recipe() without the full tidymodels? Yes. Install and load only the recipes package; it does not require parsnip, workflows, or the rest of tidymodels. You can prep() and bake() a recipe and feed the result into any modeling function, including base R lm() or glm().