caret preProcess() in R: Scale, Center and Impute Data
The preProcess() function in caret estimates a data transformation, such as centering, scaling, or imputation, from a training set. You apply it to any data with predict(), so the same recipe runs consistently on train and test sets.
preProcess(df, method = c("center","scale")) # standardize columns
preProcess(df, method = "range") # scale to 0 to 1
preProcess(df, method = "knnImpute") # impute + standardize
preProcess(df, method = "medianImpute") # fast NA fill
preProcess(df, method = "BoxCox") # correct skew
preProcess(df, method = "nzv") # drop near-zero-var cols
preProcess(df, method = "pca", thresh = 0.95) # reduce dimensions
predict(pp, newdata) # apply the recipeNeed explanation? Read on for examples and pitfalls.
What preProcess() does in one sentence
preProcess() learns a transformation; it does not apply one. You hand it a data frame of numeric predictors and a list of methods, and it returns a preProcess object that stores the parameters it estimated, such as each column's mean and standard deviation. The data itself comes back untouched.
To actually transform data, you call predict() on that object. This two-step design is deliberate. A model that was trained on standardized data must see test data standardized with the same mean and standard deviation, not values recomputed from the test set. Storing the recipe once and replaying it everywhere is what keeps train and test consistent.
preProcess() syntax and arguments
preProcess() takes a numeric data frame plus a vector of method names. Non-numeric columns are passed through unchanged, so you can hand it a mixed data frame without filtering first.
The core signature is short:
preProcess(x, method = c("center", "scale"), ...)
x: a data frame or matrix of predictors. Numeric columns are processed; others are ignored.method: a character vector of transformations. caret applies them in a fixed internal order, not the order you list.thresh: cumulative variance kept by"pca"(default0.95).k: number of neighbours used by"knnImpute"(default5).na.remove: dropNAvalues before estimating parameters (defaultTRUE).
The method argument accepts many values, grouped by purpose:
- Centering and scaling:
"center","scale","range" - Skew correction:
"BoxCox","YeoJohnson","expoTrans" - Imputation:
"knnImpute","bagImpute","medianImpute" - Filtering:
"zv","nzv","corr" - Transformation:
"pca","ica","spatialSign"
preProcess object from the training set and call predict() with it on every later batch. Re-estimating on the test set leaks test statistics into your evaluation.preProcess() examples by use case
1. Center and scale (standardize)
The most common recipe subtracts each column's mean and divides by its standard deviation.
The object reports what it estimated but holds no transformed data. Apply it with predict().
Notice the NA in row three. Centering and scaling never fill missing values, they only shift and rescale the ones present. For gaps you need an imputation method.
2. Scale to a 0 to 1 range
Use "range" when an algorithm expects bounded inputs, such as neural networks or distance metrics on mixed units.
Every value now lands between 0 and 1, with the smallest observation mapped to 0 and the largest to 1.
3. Impute missing values
The "knnImpute" method fills gaps using the five nearest complete rows.
"knnImpute". If you want imputation without standardization, use "medianImpute" or "bagImpute" instead.4. Use preProcess inside train()
You rarely call preProcess() by hand in a real workflow. The train() function takes a preProcess argument and builds the recipe inside each resampling fold for you.
preProcess(). A plain scale() call recomputes statistics from whatever data you give it, so test data gets standardized against itself. preProcess() freezes the training parameters into an object, and predict() replays them, so a model never sees test-derived statistics.preProcess() vs scale() and recipes
preProcess() is the caret-native choice; scale() and recipes cover the cases around it. Base R scale() is fine for a quick, throwaway transform. The recipes package is the modern tidymodels equivalent with a richer step API.
| Tool | Best for | Reusable on new data | Imputation |
|---|---|---|---|
preProcess() + predict() |
caret modeling workflows | Yes, recipe stored in an object | Yes (knn, bag, median) |
scale() |
quick one-off standardizing | No, recomputes every call | No |
recipes package |
tidymodels pipelines | Yes, via prep() and bake() |
Yes |
Choose preProcess() when your model is trained with caret::train(). Choose recipes when you have moved to tidymodels. Reach for scale() only for exploratory work that never touches a test set.
Common pitfalls
Pitfall 1: expecting preProcess() to return transformed data. It returns a recipe object, not a data frame. You must call predict() to get values back. Skipping that step is the single most common mistake.
Pitfall 2: standardizing with center/scale but leaving NA values. These methods ignore missing values, they do not fill them. If your data has gaps, add an imputation method to the same method vector.
Pitfall 3: fitting the recipe on the full dataset. Estimating means and standard deviations from train and test combined leaks information and inflates your accuracy estimate.
preProcess() on data that includes your test set. Build the recipe from training rows only, then apply it to test rows with predict(). Otherwise test statistics bleed into the transformation and your evaluation is optimistic.Try it yourself
Try it: Build a preProcess recipe on iris[, 1:4] that scales every numeric column to a 0 to 1 range, apply it, and save the result to ex_scaled.
Click to reveal solution
Explanation: preProcess() with method = "range" estimates the minimum and maximum of each column. predict() then rescales every value so the smallest becomes 0 and the largest becomes 1.
Related caret functions
After preProcess(), these caret functions round out a preprocessing pipeline:
predict.preProcess(): applies a fitted recipe to new datadummyVars(): one-hot encodes factor columns into numeric indicatorsnearZeroVar(): flags predictors with near-zero variance for removalfindCorrelation(): identifies highly correlated columns to droptrain(): trains and tunes a model, accepting apreProcessargument directly
FAQ
Does caret preProcess() change the data immediately?
No. preProcess() only estimates parameters and returns a preProcess object that stores them. The original data is returned unchanged. To get transformed values you call predict() on the object with the data you want processed. This separation lets you fit the recipe once on training data and replay it on any number of later batches.
What is the difference between preProcess() and scale()?
scale() standardizes a matrix in place and recomputes the mean and standard deviation every time it runs. preProcess() freezes those statistics into a reusable object. Because a model trained on standardized data must see test data standardized with the training statistics, preProcess() plus predict() is the correct choice for any workflow that splits data.
In what order does preProcess() apply methods?
caret uses a fixed internal order regardless of how you list methods. It runs variance filters first, then skew corrections such as Box-Cox, then centering, scaling, and range, then imputation, and finally PCA, ICA, or spatial sign. So method = c("scale", "center") and method = c("center", "scale") produce identical results.
How does preProcess() handle missing values?
For "center", "scale", and "range", missing values are ignored when estimating parameters and remain NA in the output. To fill gaps, add an imputation method: "knnImpute", "bagImpute", or "medianImpute". Imputation runs after scaling in caret's internal order.
Can I use preProcess() outside of caret modeling?
Yes. preProcess() works as a standalone transformer on any numeric data frame, even if you never call train(). Many people use it purely to standardize or impute data before feeding it to a non-caret model, because the stored-recipe pattern is still the safe way to keep train and test consistent.
Further Reading
- recipes step_center() in R: Center Numeric Predictors
- recipes step_log() in R: Log-Transform Skewed Predictors
- recipes step_normalize() in R: Center and Scale Predictors
- recipes step_range() in R: Scale Predictors to a 0-1 Range
- recipes step_scale() in R: Scale Predictors to Unit SD
- recipes step_corr() in R: Drop Correlated Predictors
- recipes step_ica() in R: Independent Component Features
- recipes step_kpca() in R: Kernel PCA Feature Step
- recipes step_lincomb() in R: Remove Linear Combinations
- recipes step_nzv() in R: Drop Near-Zero-Variance Predictors
- recipes step_pca() in R: PCA Feature Reduction for Modeling
- recipes step_pls() in R: PLS Feature Extraction
- recipes step_zv() in R: Remove Zero-Variance Predictors
- recipes step_date() in R: Extract Date Features for Modeling
- recipes step_holiday() in R: Add Holiday Indicator Features