recipes step_impute_mean() in R: Mean Imputation
recipes step_impute_mean() fills missing numeric values with the column mean, learned from your training data during prep() and applied during bake(). It is the simplest imputation step in the tidymodels recipes package.
step_impute_mean(rec, x) # impute one column step_impute_mean(rec, x, y) # impute several columns step_impute_mean(rec, all_numeric_predictors()) # all numeric predictors step_impute_mean(rec, x, trim = 0.1) # trimmed mean (robust) prep(rec) |> bake(new_data = NULL) # apply to training data tidy(prepped, number = 1) # see the learned means
Need explanation? Read on for examples and pitfalls.
What step_impute_mean() does
step_impute_mean() is a recipe step that replaces missing numeric values with the arithmetic mean of each column. It is part of the recipes package, the preprocessing engine of tidymodels. You add it to a recipe object and it becomes one stage in a reproducible feature-engineering pipeline.
The step has two phases. When you call prep(), it computes the mean of each selected column from the training data and stores it. When you call bake(), it uses those stored means to fill NA values in any dataset you pass in. This split is what makes mean imputation in a recipe safe for modelling.
Syntax and arguments
The function signature is short, but two arguments matter for real work. Here is the call with its defaults:
The arguments you will actually set:
recipe: the recipe object you are adding the step to....: one or more selectors naming the columns to impute, such asOzoneorall_numeric_predictors().trim: fraction (0 to 0.5) trimmed from each end of the sorted values before the mean is computed.trim = 0is the plain mean.skip: ifTRUE, the step is skipped when baking new data. Leave itFALSEfor imputation.
The means argument is filled automatically during prep() and you should not set it by hand.
step_impute_mean() examples
Start by confirming where the missing values are. The built-in airquality dataset has gaps in two columns, which makes it a clean test case.
Ozone has 37 missing values and Solar.R has 7. Now build a recipe that imputes both columns and inspect the result.
Every gap is gone. To see the exact values that were inserted, call tidy() on the prepped recipe with the step number.
You rarely want to list columns by hand. The all_numeric_predictors() selector imputes every numeric predictor at once, and trim makes the estimate resistant to outliers.
With trim = 0.1, the most extreme 10% of values at each end are dropped before averaging. For a skewed column like Ozone, the trimmed mean sits below the plain mean and resists a few large readings.
NULL returns the already-processed training set without re-running prep(), which is faster than baking the original frame again.step_impute_mean() vs other imputation steps
Mean imputation is the fastest option, but not always the right one. The recipes package ships a family of imputation steps for different data types and distributions.
| Step | Best for | Handles factors? |
|---|---|---|
step_impute_mean() |
Numeric columns, roughly symmetric | No |
step_impute_median() |
Numeric columns, skewed or with outliers | No |
step_impute_mode() |
Categorical or integer columns | Yes |
step_impute_knn() |
Columns where neighbouring rows are informative | Yes |
step_impute_linear() |
Numeric columns with a trend or strong predictor | No |
Decision rule: use step_impute_mean() for quick baselines and symmetric numeric features. Switch to step_impute_median() when a column is skewed, and to step_impute_knn() when you can afford a slower, more accurate fill.
Common pitfalls
The most common surprise is that non-numeric columns are silently ignored. step_impute_mean() only touches numeric columns. If you point it at a factor, the NA values stay untouched and no error is raised.
The numeric x is filled with 2, but the factor grp keeps its NA. Use step_impute_mode() for categorical columns.
Two more traps to avoid:
- Imputing before splitting. Always build the recipe on the training split only. Computing the mean on the full dataset leaks test information into the model.
- Imputing the outcome. Selectors like
all_predictors()exclude the response, but a bare column name does not. Never impute the variable you are predicting.
NA and the step fills nothing. Drop such columns with step_zv() or remove them before the recipe.Try it yourself
Try it: Build a recipe on airquality that imputes only the Solar.R column with its mean, then confirm no missing values remain. Save the baked data to ex_imputed.
Click to reveal solution
Explanation: The recipe selects only Solar.R, learns its mean during prep(), and bake(new_data = NULL) returns the processed training data with the gaps filled.
Related recipes steps
These steps pair naturally with mean imputation in a preprocessing pipeline.
step_impute_median(): median fill, robust to skew and outliers.step_impute_mode(): most-frequent-value fill for categorical columns.step_impute_knn(): imputes from k nearest neighbours.step_impute_linear(): imputes from a linear model of other predictors.step_normalize(): centre and scale numeric columns after imputing.
FAQ
What is the difference between step_impute_mean() and step_meanimpute()?
They do the same thing. step_meanimpute() was the original name and was deprecated in recipes 0.1.16 in favour of step_impute_mean(). The new name groups all imputation steps under a common step_impute_* prefix. Old code still runs but raises a deprecation warning, so update calls to the current spelling when you can.
Does step_impute_mean() work on factor or character columns?
No. step_impute_mean() only operates on numeric columns. If you select a factor or character column, its missing values are left in place with no error or warning. For categorical data use step_impute_mode(), which fills with the most frequent level.
When should I use mean versus median imputation?
Use mean imputation when a column is roughly symmetric and free of extreme outliers. Use step_impute_median() when the column is skewed or has outliers, because the median is not pulled toward extreme values. The trim argument of step_impute_mean() is a middle ground that drops the tails before averaging.
Does step_impute_mean() cause data leakage?
Not when used correctly. The mean is computed only during prep(), which you run on the training set. When you bake() validation or test data, recipes reuses the stored training mean. Leakage happens only if you prep the recipe on the full dataset before splitting, so always split first.