recipes step_impute_bag() in R: Bagged Tree Imputation
The recipes step_impute_bag() function fills missing values in R by training bagged decision trees on the remaining variables, making it a strong model-based choice when your data mixes numeric and categorical columns.
step_impute_bag(rec, Ozone) # impute one column step_impute_bag(rec, Ozone, Solar.R) # impute several step_impute_bag(rec, all_numeric_predictors()) # all numeric predictors step_impute_bag(rec, Ozone, trees = 50) # more trees, steadier step_impute_bag(rec, Ozone, impute_with = imp_vars(Wind, Temp)) # pick predictors step_impute_bag(rec, Ozone, seed_val = 42) # reproducible result
Need explanation? Read on for examples and pitfalls.
What step_impute_bag() does
step_impute_bag() imputes missing values with an ensemble of bagged decision trees. It is a step in the recipes package, the preprocessing engine behind tidymodels. For every column you select, it trains a bagged CART model that uses the other variables as predictors, then predicts the missing entries from that model.
Because trees handle both numeric and categorical predictors and capture nonlinear relationships, bagged tree imputation is more flexible than mean, median, or linear imputation. The cost is speed: each column gets its own ensemble, so the step is slower than simpler alternatives.
Syntax and arguments
The step takes a recipe, column selectors, and tuning arguments. You add it to a recipe with the native pipe and tune how the trees are built.
| Argument | Purpose |
|---|---|
... |
Columns to impute, chosen with tidyselect helpers like all_numeric_predictors(). |
impute_with |
Predictors used to build each model, wrapped in imp_vars(). Defaults to all other predictors. |
trees |
Number of bagged trees per column. More trees give steadier imputations. |
options |
List passed to ipred::bagging(), for example nbagg or keepX. |
seed_val |
Integer seed so the bootstrap sampling is reproducible. |
ipred::bagging() under the hood, so prep() fails with a "package ipred is required" message if ipred is not installed. Run install.packages("ipred") once before using the step.step_impute_bag() examples
These examples use airquality, a built-in dataset with real missing values. Its Ozone and Solar.R columns contain NA entries, which makes it ideal for imputation practice.
Add step_impute_bag() to a recipe, then prep() learns the tree models and bake() applies them.
Use impute_with and imp_vars() to restrict which variables build the model. This helps when some predictors are noisy or leak information.
Call tidy() on the prepped recipe to confirm which columns were imputed.
step_impute_bag() vs other imputation steps
Choose the step that matches your data types and speed budget. Bagged trees are the most general option, but simpler steps win when relationships are linear or you only need a fast baseline.
| Step | Method | Handles | Speed | Best for |
|---|---|---|---|---|
step_impute_bag() |
Bagged CART trees | Numeric + nominal | Slow | Mixed types, nonlinear patterns |
step_impute_knn() |
K nearest neighbors | Numeric + nominal | Medium | Local similarity between rows |
step_impute_linear() |
Linear regression | Numeric only | Fast | Strong linear relationships |
step_impute_mean() |
Column mean | Numeric only | Fastest | Quick numeric baseline |
step_impute_mode() |
Most frequent value | Nominal only | Fastest | Quick categorical baseline |
Decision rule: reach for step_impute_bag() when columns interact in nonlinear ways or you must impute factors and numbers in one step. If every column is numeric and you need speed, step_impute_knn() or step_impute_linear() are lighter.
IterativeImputer with a tree estimator, or sklearn.ensemble-based imputation. Both predict each missing column from the others rather than filling a constant.Common pitfalls
Three mistakes account for most step_impute_bag() trouble. Each has a clear fix.
- Missing the ipred dependency.
prep()aborts withpackage "ipred" is requiredwhen the package is absent. Install it once withinstall.packages("ipred"). - Imputing the outcome variable. Selecting your response column leaks model targets into preprocessing. Restrict selectors to predictors, for example
step_impute_bag(all_numeric_predictors()), so the outcome is never touched. - Expecting it to be fast. Twenty-five trees per column is heavy on wide or large data. Lower
trees, narrowimpute_with, or switch tostep_impute_knn()ifprep()runs too long.
Try it yourself
Try it: Impute only the Ozone column of airquality using Wind and Temp as predictors, with 30 trees. Save the baked result to ex_imputed.
Click to reveal solution
Explanation: imp_vars() limits the predictors to Wind and Temp, trees = 30 sets the ensemble size, and prep() then bake() learns and applies the imputation models.
Related recipes steps
These steps pair naturally with bagged tree imputation in a recipe.
- step_impute_knn() for distance-based imputation of mixed columns.
- step_impute_linear() for fast linear-model imputation of numeric columns.
- step_impute_mean() for a quick numeric baseline fill.
- step_impute_median() for a skew-resistant central-tendency fill.
- step_impute_mode() for imputing categorical columns by frequency.
FAQ
How does step_impute_bag() handle missing values? For each selected column, the step trains a bagged ensemble of decision trees using the other predictors as inputs and the complete rows as training data. It then predicts the missing cells from each row's observed values. Because the model learns from the data, the imputed values reflect relationships between variables rather than a single constant.
Does step_impute_bag() work on categorical variables? Yes. Bagged CART trees handle both numeric and nominal outcomes, so the step can impute factor columns as well as numeric ones. This is its main advantage over step_impute_linear() and step_impute_mean(), which are numeric-only. The same imp_vars() predictors can mix numeric and categorical inputs.
How many trees should I use in step_impute_bag()? The default of 25 trees is a reasonable starting point. More trees produce steadier, lower-variance imputations but increase prep() time roughly linearly. For large datasets, start low and raise trees only if imputed values look unstable across runs. Set seed_val so results are reproducible while you compare settings.
Why do I get a "package ipred is required" error? step_impute_bag() builds its trees with ipred::bagging(), so the ipred package must be installed. Run install.packages("ipred") once, then re-run prep(). Unlike mean or median imputation, the bagged step has this extra dependency because it fits a real model.
Is step_impute_bag() better than mean imputation? It is usually more accurate because it conditions on other variables instead of using one fixed number, which preserves relationships in the data. The trade-off is speed and an extra dependency. For exploratory work or very large data, step_impute_mean() is faster; for modeling pipelines where imputation quality matters, bagged trees are the stronger default.