caret resamples() in R: Compare Cross-Validated Models
The resamples() function in caret bundles two or more caret::train objects fit on identical resampling folds, then exposes a unified API to summarize, plot, and statistically test their performance. It is the canonical way to settle which model is best when every candidate has been tuned with the same cross-validation plan.
resamples(list(rf = fit_rf, gbm = fit_gbm, glm = fit_glm)) # bundle models summary(res) # mean/min/max per metric bwplot(res) # box-whisker by metric dotplot(res) # mean and CI by metric summary(diff(res)) # paired t-tests res$values # per-fold scores modelCor(res) # between-model correlation
Need explanation? Read on for examples and pitfalls.
What resamples() does in one sentence
resamples() is caret's model-comparison constructor. You pass a named list of fitted train, rfe, or sbf objects and it returns a resamples object holding the per-fold metric matrix for every model side by side. From there, generic methods like summary(), bwplot(), dotplot(), diff(), and modelCor() operate on that bundle.
The function never refits anything. It only collects the cross-validation scores that each train() call already produced and aligns them by fold. Because the alignment is per fold, the comparison is paired, which gives tighter confidence intervals than independent comparisons.
index = / indexOut = to trainControl() to enforce this.resamples() syntax and arguments
The signature is short and the input does the heavy lifting. Most invocations are one line with a named list of fits.
The two positional arguments are x (the named list of fits) and modelNames (defaults to names(x)). Anything else is forwarded. The names you give in list(...) become the model labels in every downstream summary and plot, so pick short, distinct labels.
Internally, resamples() extracts each fit's pred data frame (when savePredictions = "final") and the per-fold metric vector, then checks that the resampling indices match across fits. If they do not, it throws an error rather than silently averaging unequal folds.
resamples(list(rf = fit_rf, gbm = fit_gbm)) labels rows and columns by rf and gbm everywhere downstream. An unnamed list falls back to Model1, Model2, which is hard to skim in plots.resamples() examples by use case
Three patterns cover almost every comparison job: summary, paired test, and per-fold inspection. Each operates on the same resamples object.
The summary reports five-number plus mean per metric, per model. Eyeball the medians first: gbm edges out rf on both RMSE and Rsquared, and glm trails on both. That gap may or may not be statistically meaningful, which is what diff() answers.
diff() runs paired t-tests on the per-fold metric vectors, then summary() prints estimates above the diagonal and Bonferroni-adjusted p-values below. Above, glm is significantly worse than both tree models, while rf versus gbm is a coin flip.
Compare resamples() with alternatives
Pick the function that matches the comparison granularity you need.
| Goal | Function | Returns |
|---|---|---|
| Compare 2+ caret fits on shared folds | resamples() |
resamples object with paired per-fold metrics |
| Score one prediction vector | postResample(pred, obs) |
named numeric vector |
| Inspect one model's tuning results | fit$results |
data frame, one row per hyperparameter combo |
| Paired statistical test on differences | diff(resamples_obj) |
matrix of estimates and p-values |
| Compare tidymodels workflows | tune::collect_metrics() |
tibble in tidy long form |
resamples() is the only one that pairs metrics fold-by-fold and ships with plotting and testing methods out of the box. If the candidate models come from different frameworks, fall back to a manual fold loop and combine the metric vectors by hand.
Common pitfalls
Three mistakes break resamples() silently or noisily.
- Different resampling indices. Forgetting to set the same seed before each
train()call gives each model its own folds.resamples()throws a clear error in most cases, but if you reusetrainControl()without seeds across an interactive session, the folds drift. Passindex =andindexOut =totrainControl()to lock folds explicitly. - Missing metric across models. Comparing a regression model to a classification model returns NAs for every cell where one of them lacks the metric. Keep model lists homogeneous in outcome type.
- Forgetting
savePredictions = "final". Plotting methods likexyplot(res, models = c("rf","gbm"))need the per-fold predictions, not just the metrics. SetsavePredictions = "final"intrainControl()if you plan to do per-fold residual plots.
train() call, not just once at the top. Caret consumes random numbers internally during tuning, so a single top-of-script seed leaves later models with shifted RNG state and slightly different folds. The fix is set.seed(N) on the line above every train().Try it yourself
Try it: Train a knn model and a rpart model on iris with the same five-fold trainControl, bundle them with resamples(), then print the Accuracy summary.
Click to reveal solution
Explanation: Seeding immediately before each train() makes the five-fold splits identical, so resamples() can pair scores fold by fold. The summary then reports a fair head-to-head for both Accuracy and Kappa.
Related caret functions
Pair resamples() with these for a full model-selection workflow.
caret::train()builds each candidate model.caret::trainControl()defines the shared resampling plan.caret::postResample()scores a single prediction vector outside the train loop.caret::confusionMatrix()gives a richer classification report on test predictions.caret::varImp()ranks predictors after the winning model is chosen.
FAQ
Do all models passed to resamples() need to use the same trainControl?
They need the same resampling indices, which is most reliably done by reusing the same trainControl object and seeding identically before each train() call. Different summaryFunction settings are fine as long as the resulting metric names overlap; non-overlapping metrics get NAs. The strict requirement is fold alignment, not the entire trainControl being byte-identical.
Can resamples() compare regression and classification models together?
Not usefully. Regression fits report RMSE, Rsquared, and MAE; classification fits report Accuracy and Kappa. The combined object has those columns side by side with NAs in the off-diagonal cells, and diff() cannot run paired tests across metric sets. Keep each comparison to one outcome family.
What does diff() actually test?
diff() runs paired t-tests on the per-fold metric differences for every pair of models, then summary() prints Bonferroni-adjusted p-values under the diagonal and effect-size estimates above it. The null hypothesis is that the mean per-fold difference equals zero. Significance depends on both effect size and how many folds and repeats the resampling plan used.
How many models can I put in resamples()?
There is no hard cap, but readability of plots and the Bonferroni penalty on diff() both degrade past 5 to 6 models. For wider sweeps, narrow to the top performers with train() first, then bundle the finalists for the formal comparison.
Does resamples() work with rfe and sbf objects?
Yes. resamples() accepts any combination of train, rfe, and sbf objects in the input list, as long as the resampling indices match. The summary, plot, and diff methods all behave the same way regardless of which class produced the underlying scores.