workflows add_formula() in R: Attach a Formula to a Workflow
The workflows add_formula() function in R attaches a model formula like y ~ x1 + x2 to a tidymodels workflow as the preprocessor. The workflow stores the formula and replays it inside fit() and predict(), so you get the formula syntax of base R with the resampling and tuning machinery of tidymodels.
workflow() |> add_formula(mpg ~ wt + hp) # basic formula preprocessor workflow() |> add_formula(am ~ .) # all other columns as predictors workflow() |> add_formula(mpg ~ wt * hp) # main effects plus interaction workflow() |> add_formula(mpg ~ poly(wt, 2) + hp) # polynomial term, expanded by stats add_formula(wf, mpg ~ wt, blueprint = bp) # custom hardhat blueprint update_formula(wf, mpg ~ wt + hp + disp) # swap the formula in place remove_formula(wf) # detach the formula
Need explanation? Read on for examples and pitfalls.
What add_formula() does
add_formula() registers a model formula as the preprocessor of a workflow. It does not run the formula. The function records the formula in the workflow's preprocessor slot, the same way add_recipe() records a recipes object and add_model() records a parsnip spec. The actual model.matrix() expansion happens later, when fit(), fit_resamples(), or tune_grid() is called on the workflow.
The formula is held as an unevaluated R expression. When the workflow is fit, the formula and the training data are passed through the hardhat preprocessor, which builds the predictor matrix and outcome vector. Factor columns become dummy variables, interaction terms multiply out, and inline transformations like log() or poly() are evaluated. The resulting design matrix goes to the engine that the parsnip spec selected.
mpg ~ wt * hp and the workflow expands it the same way lm() would. The win is that the formula travels with the workflow, so resampling, tuning, and prediction all see the same predictor expansion without you re-typing the formula in three places.add_formula() syntax and arguments
add_formula() takes a workflow, a formula, and an optional blueprint. The signature is short and most users only ever pass the first two arguments.
The x argument must be a workflow() object, usually piped in from workflow(). The formula argument is any valid model formula; the left side is the outcome and the right side is the predictor expression. The blueprint argument from the hardhat package controls indicator coding, intercept handling, and how NA rows are dropped at prediction time. The default blueprint matches what lm() and glm() do, so you rarely need to override it.
add_formula() returns a new workflow. The function is pure and does not mutate its input. Assign the result back to a variable or chain it into another pipe step.
workflows::add_formula() is cheap because nothing is computed yet. The formula is held alongside the model spec until the workflow is fit. To preview the expansion, call model.matrix(formula, data) on the formula and data directly, outside the workflow.add_formula() is one of three preprocessor verbs in the workflows package. Picking the right verb upfront avoids restructuring later.
| Verb | Preprocessor input | When to use |
|---|---|---|
add_formula() |
A model formula y ~ x1 + x2 |
Lightweight model, base R formula expansion is enough |
add_recipe() |
A recipe() with explicit steps |
Need imputation, scaling, or custom transformations |
add_variables() |
Outcome and predictor column names | Engine refuses formulas (XGBoost matrix input) |
Reach for add_formula() when the formula syntax is enough and you do not need preprocessing steps to be re-estimated on each resample.
Build workflows with add_formula(): four examples
Every example below uses the built-in mtcars and airquality datasets so the focus stays on the formula plumbing rather than on the data.
Example 1: A two-predictor linear regression
The smallest useful workflow has a formula and a model. This is the form most tidymodels tutorials open with.
The workflow now holds both the formula and the parsnip spec. Calling fit(wf_lin, data = mtcars) hands mpg ~ wt + hp to lm() along with the training data and returns a fitted workflow you can pass to predict() or tune::tune_grid().
Example 2: All-other-columns shorthand and dot expansion
Formulas accept the . shorthand for "every other column in data". This works inside a workflow the same way it does in lm().
The . is expanded at fit time using the column names present in the training data. New data passed to predict() must contain those same columns; missing predictors raise an error from hardhat::forge(). This is one of the rare moments the workflow surfaces a base R behaviour rather than smoothing it over.
Example 3: Interactions and polynomial terms
Formula syntax expands interactions and polynomial expressions in place. You get the same x * z and poly(x, 2) shortcuts that base R supplies.
The wt * hp term added the two main effects plus the wt:hp interaction without you naming them. The poly(disp, 2) term created two orthogonal polynomial columns. Both expansions happen inside model.matrix() when the workflow is fit, so new data only needs the raw wt, hp, and disp columns at predict time.
Example 4: When add_formula() is not enough
Formula transformations like log() are evaluated on the data passed in, not re-estimated. That is fine for fixed transformations and wrong for anything that learns from the data.
For learned transformations, swap add_formula() for add_recipe() and use step_normalize() so the standardization is re-estimated on the training half of every resample. The formula path is correct for log, sqrt, poly, and similar deterministic expressions.
Common pitfalls
A workflow holds exactly one preprocessor and one formula. These are the four errors you will hit while learning add_formula().
scale() leak across resamples. A formula such as Ozone ~ scale(Solar.R) will fit, but the mean and standard deviation are computed on whatever data the current fit() call receives. Under cross-validation that means summary statistics drift from fold to fold and can mix train and test rows. Move the standardization into a recipe with step_normalize() whenever the transform learns from the data.Try it yourself
Try it: Build a workflow that uses workflows::add_formula() to fit a linear regression of Petal.Length on Sepal.Length, Sepal.Width, and their interaction in the iris dataset. Save the fitted workflow to ex_fit.
Click to reveal solution
Explanation: The Sepal.Length * Sepal.Width term expands inside model.matrix() into the two main effects plus their interaction, exactly as lm() would. Wrapping the formula in add_formula() lets you reuse the same workflow object for resampling or tuning later, without redeclaring the formula.
Related workflows functions
add_formula() is one verb in a small family. These functions appear next to it in almost every tidymodels script.
workflow()creates the empty workflow thatadd_formula()updates.add_model()attaches the parsnip spec that pairs with the formula.add_recipe()swaps the formula for a full recipes preprocessor.add_variables()attaches bare column names when even a formula is too much structure.update_formula()replaces the formula inside an existing workflow.remove_formula()detaches the formula entirely.
See the workflows package reference for the full verb family.
FAQ
Does add_formula() do any preprocessing?
No. workflows::add_formula() only stores the formula in the workflow's preprocessor slot. The formula is expanded into a design matrix later, inside fit(), by hardhat::mold(). That expansion runs whatever inline functions appear in the formula, such as log() or poly(), but it does not learn parameters from the data. For learned preprocessing like normalization or imputation, use add_recipe() so the transformations are re-estimated on each resample.
Can I use add_formula() and add_recipe() in the same workflow?
No. A workflow holds exactly one preprocessor. Calling both add_formula() and add_recipe() on the same workflow raises an error. Choose add_formula() when a base R formula is enough and the engine accepts formulas. Choose add_recipe() when the preprocessing needs explicit steps that must be re-estimated on each fold of cross-validation.
How is add_formula() different from writing the formula directly in lm()?
The formula is identical, but the workflow keeps it bundled with the model spec and any tuning grid. That means resampling, tuning, and prediction all reuse the same formula without you typing it again. The workflow object also makes the pipeline easier to swap engines on; flipping set_engine("lm") to set_engine("glmnet") does not touch the formula at all.
Why does scale() inside a formula leak across resamples?
scale() computes the mean and standard deviation of its argument at the moment the formula is evaluated. Inline transformations cannot remember what they saw on a previous fold, so the statistics drift fold by fold. A recipe step like step_normalize() records training-fold statistics inside the workflow, so prediction on the held-out fold uses the matching values.
Can I update the formula after adding it?
Yes. Use update_formula(wf, new_formula). It replaces the formula in place without touching the model spec or refitting anything. Calling add_formula() twice on the same workflow fails because the preprocessor slot is already filled.