caret preProcess() in R: Scale, Center and Impute Data

The preProcess() function in caret estimates a data transformation, such as centering, scaling, or imputation, from a training set. You apply it to any data with predict(), so the same recipe runs consistently on train and test sets.

⚡ Quick Answer
preProcess(df, method = c("center","scale"))   # standardize columns
preProcess(df, method = "range")                # scale to 0 to 1
preProcess(df, method = "knnImpute")            # impute + standardize
preProcess(df, method = "medianImpute")         # fast NA fill
preProcess(df, method = "BoxCox")               # correct skew
preProcess(df, method = "nzv")                  # drop near-zero-var cols
preProcess(df, method = "pca", thresh = 0.95)   # reduce dimensions
predict(pp, newdata)                            # apply the recipe

Need explanation? Read on for examples and pitfalls.

📊 Is preProcess() the right tool?
STARTtransform numeric predictors before modelingpreProcess(df, method = ...)one-hot encode factor columnsdummyVars(~ ., data = df)drop near-zero-variance columnsnearZeroVar(df, names = TRUE)drop highly correlated columnsfindCorrelation(cor(df))split data into train and testcreateDataPartition(y, p = 0.7)tune a model with resamplingtrain(y ~ ., data = df, method = "rf")

What preProcess() does in one sentence

preProcess() learns a transformation; it does not apply one. You hand it a data frame of numeric predictors and a list of methods, and it returns a preProcess object that stores the parameters it estimated, such as each column's mean and standard deviation. The data itself comes back untouched.

To actually transform data, you call predict() on that object. This two-step design is deliberate. A model that was trained on standardized data must see test data standardized with the same mean and standard deviation, not values recomputed from the test set. Storing the recipe once and replaying it everywhere is what keeps train and test consistent.

preProcess() syntax and arguments

preProcess() takes a numeric data frame plus a vector of method names. Non-numeric columns are passed through unchanged, so you can hand it a mixed data frame without filtering first.

Run live
Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.
RLoad caret and build a dataset
library(caret) # mtcars columns with two missing horsepower values df <- mtcars[, c("mpg", "hp", "wt", "disp")] df$hp[c(3, 8)] <- NA summary(df$hp) #> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's #> 52.0 96.5 136.5 151.3 180.0 335.0 2

  

The core signature is short:

preProcess(x, method = c("center", "scale"), ...)
  • x: a data frame or matrix of predictors. Numeric columns are processed; others are ignored.
  • method: a character vector of transformations. caret applies them in a fixed internal order, not the order you list.
  • thresh: cumulative variance kept by "pca" (default 0.95).
  • k: number of neighbours used by "knnImpute" (default 5).
  • na.remove: drop NA values before estimating parameters (default TRUE).

The method argument accepts many values, grouped by purpose:

  • Centering and scaling: "center", "scale", "range"
  • Skew correction: "BoxCox", "YeoJohnson", "expoTrans"
  • Imputation: "knnImpute", "bagImpute", "medianImpute"
  • Filtering: "zv", "nzv", "corr"
  • Transformation: "pca", "ica", "spatialSign"
Tip
Estimate the recipe on training data, then reuse it. Build one preProcess object from the training set and call predict() with it on every later batch. Re-estimating on the test set leaks test statistics into your evaluation.

preProcess() examples by use case

1. Center and scale (standardize)

The most common recipe subtracts each column's mean and divides by its standard deviation.

REstimate centering and scaling
pp <- preProcess(df, method = c("center", "scale")) pp #> Created from 30 samples and 4 variables #> #> Pre-processing: #> - centered (4) #> - ignored (0) #> - scaled (4)

  

The object reports what it estimated but holds no transformed data. Apply it with predict().

RApply the recipe with predict
df_cs <- predict(pp, df) round(head(df_cs, 3), 2) #> mpg hp wt disp #> Mazda RX4 0.15 -0.54 -0.61 -0.57 #> Mazda RX4 Wag 0.15 -0.54 -0.35 -0.57 #> Datsun 710 0.45 NA -0.92 -0.99

  

Notice the NA in row three. Centering and scaling never fill missing values, they only shift and rescale the ones present. For gaps you need an imputation method.

2. Scale to a 0 to 1 range

Use "range" when an algorithm expects bounded inputs, such as neural networks or distance metrics on mixed units.

RScale predictors to a 0 to 1 range
pp_range <- preProcess(df, method = "range") df_range <- predict(pp_range, df) summary(df_range$mpg) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0000 0.2138 0.3745 0.4124 0.5277 1.0000

  

Every value now lands between 0 and 1, with the smallest observation mapped to 0 and the largest to 1.

3. Impute missing values

The "knnImpute" method fills gaps using the five nearest complete rows.

RImpute missing values with knnImpute
pp_knn <- preProcess(df, method = "knnImpute") df_knn <- predict(pp_knn, df) sum(is.na(df_knn$hp)) #> [1] 0

  
Note
knnImpute also centers and scales. Distance-based imputation needs comparable units, so caret standardizes every column as part of "knnImpute". If you want imputation without standardization, use "medianImpute" or "bagImpute" instead.

4. Use preProcess inside train()

You rarely call preProcess() by hand in a real workflow. The train() function takes a preProcess argument and builds the recipe inside each resampling fold for you.

RPass preProcess straight to train()
set.seed(1) model <- train( mpg ~ hp + wt + disp, data = mtcars, method = "lm", preProcess = c("center", "scale") ) model$preProcess #> Created from 32 samples and 3 variables #> #> Pre-processing: #> - centered (3) #> - scaled (3)

  
Key Insight
The learn-then-apply split is the whole point of preProcess(). A plain scale() call recomputes statistics from whatever data you give it, so test data gets standardized against itself. preProcess() freezes the training parameters into an object, and predict() replays them, so a model never sees test-derived statistics.

preProcess() vs scale() and recipes

preProcess() is the caret-native choice; scale() and recipes cover the cases around it. Base R scale() is fine for a quick, throwaway transform. The recipes package is the modern tidymodels equivalent with a richer step API.

Tool Best for Reusable on new data Imputation
preProcess() + predict() caret modeling workflows Yes, recipe stored in an object Yes (knn, bag, median)
scale() quick one-off standardizing No, recomputes every call No
recipes package tidymodels pipelines Yes, via prep() and bake() Yes

Choose preProcess() when your model is trained with caret::train(). Choose recipes when you have moved to tidymodels. Reach for scale() only for exploratory work that never touches a test set.

Common pitfalls

Pitfall 1: expecting preProcess() to return transformed data. It returns a recipe object, not a data frame. You must call predict() to get values back. Skipping that step is the single most common mistake.

Pitfall 2: standardizing with center/scale but leaving NA values. These methods ignore missing values, they do not fill them. If your data has gaps, add an imputation method to the same method vector.

Pitfall 3: fitting the recipe on the full dataset. Estimating means and standard deviations from train and test combined leaks information and inflates your accuracy estimate.

Warning
Never estimate preProcess() on data that includes your test set. Build the recipe from training rows only, then apply it to test rows with predict(). Otherwise test statistics bleed into the transformation and your evaluation is optimistic.
REstimate on training data only
set.seed(42) idx <- createDataPartition(mtcars$mpg, p = 0.7, list = FALSE) train <- mtcars[idx, c("mpg", "hp", "wt")] test <- mtcars[-idx, c("mpg", "hp", "wt")] pp_train <- preProcess(train, method = c("center", "scale")) test_cs <- predict(pp_train, test) nrow(test_cs) #> [1] 9

  

Try it yourself

Try it: Build a preProcess recipe on iris[, 1:4] that scales every numeric column to a 0 to 1 range, apply it, and save the result to ex_scaled.

RYour turn: range-scale iris
# Try it: range-scale the iris predictors ex_scaled <- # your code here summary(ex_scaled$Sepal.Length) #> Expected: Min. near 0.0, Max. near 1.0

  
Click to reveal solution
RSolution
pp_iris <- preProcess(iris[, 1:4], method = "range") ex_scaled <- predict(pp_iris, iris[, 1:4]) summary(ex_scaled$Sepal.Length) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 0.0000 0.2222 0.4167 0.4287 0.5833 1.0000

  

Explanation: preProcess() with method = "range" estimates the minimum and maximum of each column. predict() then rescales every value so the smallest becomes 0 and the largest becomes 1.

After preProcess(), these caret functions round out a preprocessing pipeline:

  • predict.preProcess(): applies a fitted recipe to new data
  • dummyVars(): one-hot encodes factor columns into numeric indicators
  • nearZeroVar(): flags predictors with near-zero variance for removal
  • findCorrelation(): identifies highly correlated columns to drop
  • train(): trains and tunes a model, accepting a preProcess argument directly

FAQ

Does caret preProcess() change the data immediately?

No. preProcess() only estimates parameters and returns a preProcess object that stores them. The original data is returned unchanged. To get transformed values you call predict() on the object with the data you want processed. This separation lets you fit the recipe once on training data and replay it on any number of later batches.

What is the difference between preProcess() and scale()?

scale() standardizes a matrix in place and recomputes the mean and standard deviation every time it runs. preProcess() freezes those statistics into a reusable object. Because a model trained on standardized data must see test data standardized with the training statistics, preProcess() plus predict() is the correct choice for any workflow that splits data.

In what order does preProcess() apply methods?

caret uses a fixed internal order regardless of how you list methods. It runs variance filters first, then skew corrections such as Box-Cox, then centering, scaling, and range, then imputation, and finally PCA, ICA, or spatial sign. So method = c("scale", "center") and method = c("center", "scale") produce identical results.

How does preProcess() handle missing values?

For "center", "scale", and "range", missing values are ignored when estimating parameters and remain NA in the output. To fill gaps, add an imputation method: "knnImpute", "bagImpute", or "medianImpute". Imputation runs after scaling in caret's internal order.

Can I use preProcess() outside of caret modeling?

Yes. preProcess() works as a standalone transformer on any numeric data frame, even if you never call train(). Many people use it purely to standardize or impute data before feeding it to a non-caret model, because the stored-recipe pattern is still the safe way to keep train and test consistent.