recipes step_impute_bag() in R: Bagged Tree Imputation

The recipes step_impute_bag() function fills missing values in R by training bagged decision trees on the remaining variables, making it a strong model-based choice when your data mixes numeric and categorical columns.

By Selva Prabhakaran · Published May 19, 2026 · Last updated May 19, 2026

⚡ Quick Answer

step_impute_bag(rec, Ozone)                            # impute one column
step_impute_bag(rec, Ozone, Solar.R)                   # impute several
step_impute_bag(rec, all_numeric_predictors())         # all numeric predictors
step_impute_bag(rec, Ozone, trees = 50)                # more trees, steadier
step_impute_bag(rec, Ozone, impute_with = imp_vars(Wind, Temp))  # pick predictors
step_impute_bag(rec, Ozone, seed_val = 42)             # reproducible result

Need explanation? Read on for examples and pitfalls.

📊 Is step_impute_bag() the right tool?

What step_impute_bag() does

step_impute_bag() imputes missing values with an ensemble of bagged decision trees. It is a step in the recipes package, the preprocessing engine behind tidymodels. For every column you select, it trains a bagged CART model that uses the other variables as predictors, then predicts the missing entries from that model.

Because trees handle both numeric and categorical predictors and capture nonlinear relationships, bagged tree imputation is more flexible than mean, median, or linear imputation. The cost is speed: each column gets its own ensemble, so the step is slower than simpler alternatives.

Key Insight

Bagged tree imputation treats every gap as a prediction problem. Instead of substituting one fixed number, it learns a model from the rows that are complete, then predicts the missing cell from each row's other values. That is why it adapts to interactions a single mean could never capture.

Syntax and arguments

The step takes a recipe, column selectors, and tuning arguments. You add it to a recipe with the native pipe and tune how the trees are built.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

RBasic step_impute_bag signature

step_impute_bag( recipe, ..., impute_with = imp_vars(all_predictors()), trees = 25, options = list(keepX = FALSE), seed_val = sample.int(10^4, 1) )

Argument	Purpose
`...`	Columns to impute, chosen with tidyselect helpers like `all_numeric_predictors()`.
`impute_with`	Predictors used to build each model, wrapped in `imp_vars()`. Defaults to all other predictors.
`trees`	Number of bagged trees per column. More trees give steadier imputations.
`options`	List passed to `ipred::bagging()`, for example `nbagg` or `keepX`.
`seed_val`	Integer seed so the bootstrap sampling is reproducible.

Warning

step_impute_bag() requires the ipred package. The step calls ipred::bagging() under the hood, so prep() fails with a "package ipred is required" message if ipred is not installed. Run install.packages("ipred") once before using the step.

step_impute_bag() examples

These examples use airquality, a built-in dataset with real missing values. Its Ozone and Solar.R columns contain NA entries, which makes it ideal for imputation practice.

RLoad recipes and inspect missing data

library(recipes) colSums(is.na(airquality)) #> Ozone Solar.R Wind Temp Month Day #> 37 7 0 0 0 0

Add step_impute_bag() to a recipe, then prep() learns the tree models and bake() applies them.

RImpute Ozone and Solar.R with bagged trees

rec <- recipe(Temp ~ ., data = airquality) |> step_impute_bag(Ozone, Solar.R) prepped <- prep(rec, training = airquality) imputed <- bake(prepped, new_data = airquality) colSums(is.na(imputed)) #> Ozone Solar.R Wind Temp Month Day #> 0 0 0 0 0 0

Use impute_with and imp_vars() to restrict which variables build the model. This helps when some predictors are noisy or leak information.

RChoose predictors and tune the tree count

rec_tuned <- recipe(Temp ~ ., data = airquality) |> step_impute_bag( Ozone, Solar.R, impute_with = imp_vars(Wind, Temp, Month), trees = 50, seed_val = 42 ) prepped_tuned <- prep(rec_tuned, training = airquality) sum(is.na(bake(prepped_tuned, new_data = airquality)$Ozone)) #> [1] 0

Call tidy() on the prepped recipe to confirm which columns were imputed.

RInspect the fitted imputation models

tidy(prepped, number = 1) #> # A tibble: 2 x 3 #> terms model id #> <chr> <list> <chr> #> 1 Ozone <classbagg> impute_bag_a1b2c #> 2 Solar.R <classbagg> impute_bag_a1b2c

step_impute_bag() vs other imputation steps

Choose the step that matches your data types and speed budget. Bagged trees are the most general option, but simpler steps win when relationships are linear or you only need a fast baseline.

Step	Method	Handles	Speed	Best for
`step_impute_bag()`	Bagged CART trees	Numeric + nominal	Slow	Mixed types, nonlinear patterns
`step_impute_knn()`	K nearest neighbors	Numeric + nominal	Medium	Local similarity between rows
`step_impute_linear()`	Linear regression	Numeric only	Fast	Strong linear relationships
`step_impute_mean()`	Column mean	Numeric only	Fastest	Quick numeric baseline
`step_impute_mode()`	Most frequent value	Nominal only	Fastest	Quick categorical baseline

Decision rule: reach for step_impute_bag() when columns interact in nonlinear ways or you must impute factors and numbers in one step. If every column is numeric and you need speed, step_impute_knn() or step_impute_linear() are lighter.

Note

Coming from Python scikit-learn? The closest equivalent is IterativeImputer with a tree estimator, or sklearn.ensemble-based imputation. Both predict each missing column from the others rather than filling a constant.

Common pitfalls

Three mistakes account for most step_impute_bag() trouble. Each has a clear fix.

Missing the ipred dependency. prep() aborts with package "ipred" is required when the package is absent. Install it once with install.packages("ipred").
Imputing the outcome variable. Selecting your response column leaks model targets into preprocessing. Restrict selectors to predictors, for example step_impute_bag(all_numeric_predictors()), so the outcome is never touched.
Expecting it to be fast. Twenty-five trees per column is heavy on wide or large data. Lower trees, narrow impute_with, or switch to step_impute_knn() if prep() runs too long.

Try it yourself

Try it: Impute only the Ozone column of airquality using Wind and Temp as predictors, with 30 trees. Save the baked result to ex_imputed.

RYour turn: impute Ozone with bagged trees

# Try it: impute Ozone only ex_rec <- # your code here ex_imputed <- # prep then bake sum(is.na(ex_imputed$Ozone)) #> Expected: 0

Click to reveal solution

RSolution

ex_rec <- recipe(Temp ~ ., data = airquality) |> step_impute_bag(Ozone, impute_with = imp_vars(Wind, Temp), trees = 30) ex_imputed <- bake(prep(ex_rec, training = airquality), new_data = airquality) sum(is.na(ex_imputed$Ozone)) #> [1] 0

Explanation: imp_vars() limits the predictors to Wind and Temp, trees = 30 sets the ensemble size, and prep() then bake() learns and applies the imputation models.

These steps pair naturally with bagged tree imputation in a recipe.

step_impute_knn() for distance-based imputation of mixed columns.
step_impute_linear() for fast linear-model imputation of numeric columns.
step_impute_mean() for a quick numeric baseline fill.
step_impute_median() for a skew-resistant central-tendency fill.
step_impute_mode() for imputing categorical columns by frequency.

FAQ

How does step_impute_bag() handle missing values? For each selected column, the step trains a bagged ensemble of decision trees using the other predictors as inputs and the complete rows as training data. It then predicts the missing cells from each row's observed values. Because the model learns from the data, the imputed values reflect relationships between variables rather than a single constant.

Does step_impute_bag() work on categorical variables? Yes. Bagged CART trees handle both numeric and nominal outcomes, so the step can impute factor columns as well as numeric ones. This is its main advantage over step_impute_linear() and step_impute_mean(), which are numeric-only. The same imp_vars() predictors can mix numeric and categorical inputs.

How many trees should I use in step_impute_bag()? The default of 25 trees is a reasonable starting point. More trees produce steadier, lower-variance imputations but increase prep() time roughly linearly. For large datasets, start low and raise trees only if imputed values look unstable across runs. Set seed_val so results are reproducible while you compare settings.

Why do I get a "package ipred is required" error? step_impute_bag() builds its trees with ipred::bagging(), so the ipred package must be installed. Run install.packages("ipred") once, then re-run prep(). Unlike mean or median imputation, the bagged step has this extra dependency because it fits a real model.

Is step_impute_bag() better than mean imputation? It is usually more accurate because it conditions on other variables instead of using one fixed number, which preserves relationships in the data. The trade-off is speed and an extra dependency. For exploratory work or very large data, step_impute_mean() is faster; for modeling pipelines where imputation quality matters, bagged trees are the stronger default.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

recipes step_impute_bag() in R: Bagged Tree Imputation

What step_impute_bag() does

Syntax and arguments

step_impute_bag() examples

step_impute_bag() vs other imputation steps

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

recipes step_impute_bag() in R: Bagged Tree Imputation

What step_impute_bag() does

Syntax and arguments

step_impute_bag() examples

step_impute_bag() vs other imputation steps

Common pitfalls

Try it yourself

Related recipes steps

FAQ