caret predict.train() in R: Score New Data From Models
The caret predict() method scores new data from a model fitted by train(). It dispatches to predict.train(), applies any preprocessing recorded during training, and returns either class labels or class probabilities depending on the type argument.
predict(fit, newdata = test_df) # default, raw output predict(fit, newdata = test_df, type = "raw") # explicit class or value predict(fit, newdata = test_df, type = "prob") # class probabilities predict(fit, newdata = test_df, na.action = na.pass) # keep NA rows aligned predict(fit, newdata = test_df[, predictors(fit)]) # only required columns predict(list(rf = fit1, gbm = fit2), newdata = test_df) # list of train objects
Need explanation? Read on for examples and pitfalls.
What predict.train() does in one sentence
predict.train() is the S3 method invoked when you call predict() on a caret train object. It takes the fitted model, validates the new data against the recorded predictors, applies the same preprocessing (centering, scaling, dummy variables, imputation) used during training, and returns predictions in the format you request through type.
You almost never type predict.train directly. You call predict(fit, newdata = ...) and R dispatches the right method based on class(fit), which is "train" for caret models. The method matters because it transparently runs the preprocessing pipeline, so the new data does not need to be transformed by hand before scoring.
predict.train() syntax and arguments
The signature is small but every argument matters. Most users only set newdata and type, but the rest control how missing values and out-of-set factor levels are handled.
If newdata is NULL, the method returns predictions on the training data, which is rarely what you want for evaluation. The type = "prob" option only works for classification models that the underlying engine supports for probability output.
| Argument | Default | When to change it |
|---|---|---|
newdata |
NULL (training data) |
Always pass a held-out frame for honest scoring |
type |
"raw" |
Set to "prob" for ROC, lift, or custom thresholds |
na.action |
na.omit |
Use na.pass to keep row alignment with newdata |
... |
(engine-specific) | Forward arguments like n.trees to the fitted model |
predict.train() examples by use case
1. Score a regression model on a held-out test set
Splitting first, then scoring, gives an honest estimate of new-data error.
The predict() call uses no extra preprocessing here because the formula is simple. With centering or dummy variables, caret would apply them inside predict() without you touching the test frame.
2. Get class probabilities for a classification model
Class probabilities are what you need for ROC curves, lift charts, and custom thresholds.
predict() with type = "prob" errors out. Set it once at the start and you can always fall back to raw labels.3. Apply a custom probability threshold
Default classification uses a 0.5 cutoff, which is rarely optimal for imbalanced classes.
Raise the threshold to be conservative on the positive class; lower it to catch more positives at the cost of false alarms. Pick the threshold from a validation set, not the test set.
4. Predict only on the columns the model needs
predictors() returns the columns the train object actually uses, so you can pass a wider frame and let caret subset.
This is handy when the production frame carries id columns, timestamps, or auxiliary fields that were not used during training.
predict.train() vs predict() on the raw model
predict.train() preserves the preprocessing recipe; predict() on the unwrapped model does not. When you call predict(fit, ...) on a caret train object, caret runs the same preProcess and dummy-variable steps it ran during training. If you extract fit$finalModel and call predict() on that, those steps disappear and the predictions silently misalign with what the model expects.
Use the train-object method as the default. Reach into finalModel only when you need engine-specific arguments that caret does not expose.
Common pitfalls
Three mistakes account for most predict.train() failures in practice.
- Factor levels missing from newdata. If your test frame has a factor with fewer levels than the training frame,
predict()errors with "factor has new levels" or returns NA. Cast test factors to the training levels:test_df$x <- factor(test_df$x, levels = levels(train_df$x)). - type = "prob" without classProbs = TRUE. Set
classProbs = TRUEintrainControl()before you train if you ever need probabilities. - Dropped rows from na.action = na.omit. The default silently removes rows containing NA, so your prediction vector is shorter than your test frame. Use
na.action = na.passand impute upstream, or wrap withpreProcess(method = "medianImpute").
prob$<level> or prob[, "<level>"] instead of positional indexing; level order changes if your training set is re-randomised.Try it yourself
Try it: Fit a knn classification model on the iris dataset, predict class probabilities for a 30% held-out test split, and compute the accuracy at a 0.5 threshold on the dominant class.
Click to reveal solution
Explanation: train() cross-validates the knn neighbours hyperparameter, then predict(ex_fit, ex_test) dispatches to predict.train() which applies the chosen k and returns class labels. Accuracy on the held-out third of iris is typically above 0.94.
Related caret functions
predict.train() pairs with these caret helpers.
- train(): fits the model whose train object predict() consumes.
- trainControl(): set
classProbs = TRUEhere sotype = "prob"works. - confusionMatrix(): compare predicted labels to true labels and get accuracy, sensitivity, kappa.
- createDataPartition(): split data before training so predict() has a test set to score.
- varImp(): rank features the trained model relies on; complements probability inspection.
For the formal reference, see the caret model documentation maintained by Max Kuhn.
FAQ
What is the difference between predict() and predict.train()?
predict() is a generic R function. When you call it on a caret train object, R dispatches to predict.train(), which is the method registered for the train class. You almost never write predict.train() directly; calling predict(fit, ...) is idiomatic and forwards to the right method. The benefit of going through the generic is that caret applies any preprocessing recorded in trainControl() before scoring.
Why does predict() return NA for some rows?
The default na.action = na.omit drops rows containing NA in any predictor, then returns predictions only for the surviving rows. If your output vector is shorter than your input frame, that is why. Pass na.action = na.pass to keep NAs in place, or impute first using preProcess(method = "medianImpute") inside train().
How do I get a probability for the positive class only?
Use predict(fit, newdata, type = "prob")[, "<positive_level>"] where <positive_level> is the name of the class you treat as positive. Avoid positional indexing like [, 2]; if you re-train with a different random seed or factor ordering, position 2 may no longer be the class you expect.
Can I predict from a list of train objects at once?
Yes. Pass a list of train objects to predict() and you get a matrix or data frame of predictions, one column per model. This is the basis for simple model averaging or for building a stacking layer downstream.
Does predict.train() apply the same preprocessing used during train()?
Yes, and that is the main reason to prefer it over predict(fit$finalModel, ...). Centering, scaling, dummy encoding, and median imputation specified in preProcess are re-applied to newdata so the model sees inputs in the same form it was trained on.