caret twoClassSummary() in R: ROC, Sensitivity, Specificity
The twoClassSummary() function in caret is the binary-classification summaryFunction that trainControl() calls on every resample when the outcome is a two-level factor. It accepts a data frame with obs, pred, and one probability column per class, then returns ROC (AUC), Sensitivity, and Specificity as a named numeric vector. Wiring it in is the standard way to optimise a caret model on AUC instead of accuracy.
twoClassSummary(df, lev = levels(df$obs)) # direct call trainControl(summaryFunction = twoClassSummary, classProbs = TRUE) # wire-in train(..., metric = "ROC") # optimise on AUC train(..., metric = "Sens", maximize = TRUE) # optimise on Sensitivity twoClassSummary(data, lev = NULL, model = NULL) # full signature fit$resample # ROC/Sens/Spec per fold levels(factor)[1] # the "event" class
Need explanation? Read on for examples and pitfalls.
What twoClassSummary() does in one sentence
twoClassSummary() is caret's binary-classifier scoring contract. It is the function train() calls on each fold's held-out predictions when the outcome is a two-level factor and classProbs = TRUE, and the function any custom binary summaryFunction must imitate in shape. The body computes ROC AUC using pROC::roc() against the first level's probability column, plus Sensitivity and Specificity from the predicted class labels.
Because the metric is computed per resample and then averaged, ROC reported in fit$results is the mean of fold-wise AUCs, not a single AUC pooled across all out-of-fold predictions. Both numbers are commonly cited; the mean-of-folds version is what metric = "ROC" tunes on.
levels(data$obs)[1] as the "event" label and computes Sensitivity against it. If your outcome is c("No", "Yes") alphabetically, "No" becomes positive, which is almost never what you want. Always set the desired event level first with factor(x, levels = c("Yes", "No")).twoClassSummary() syntax and arguments
The signature has three arguments and the data frame must carry four columns minimum. caret fixes the shape so any function plugged into summaryFunction is interchangeable across regression and classification.
The required argument is data, a data frame with obs (truth factor), pred (predicted factor), and one numeric column per factor level holding class probabilities. The probability column names must match the factor levels exactly. The lev argument is the vector of level names; caret passes it automatically, but you pass it yourself when calling outside train().
The return value is a length-three named numeric vector: ROC, Sens, Spec. train() rbinds one row per resample into fit$resample and averages into fit$results. Whichever name matches metric = in train() is used for tuning.
classProbs = TRUE in trainControl. Without probabilities, the function has no way to compute ROC AUC. caret throws an error during train() setup if you wire summaryFunction = twoClassSummary while classProbs is left at its default of FALSE.twoClassSummary() examples by use case
Four patterns cover almost every call: direct scoring, wiring into trainControl, optimising on ROC, and wrapping for added metrics. Each reuses the same function with slightly different framing.
Each row of fit$resample comes from one call to twoClassSummary() on that fold's held-out predictions. Optimising on ROC happens automatically because metric = "ROC" matches a column the function returns.
The ROCSD, SensSD, SpecSD columns are the standard deviations across folds, useful for spotting unstable models. caret picks the row with the maximum ROC as the best tune.
A fourth pattern wraps twoClassSummary to add Precision and F1 without rewriting the binary core. Bind the new metrics to the named vector and keep the original three intact.
The wrapper preserves ROC, Sens, Spec and appends Precision and F1. Any function returning a named numeric vector is a valid summaryFunction, so this pattern is how you graft new metrics into caret without forking it.
fit$resample once during model development. Per-fold ROC variance reveals fragility that the averaged row hides. A model with mean ROC 0.92 and per-fold range 0.78 to 0.99 is much less reliable than one with mean 0.90 and range 0.88 to 0.92.twoClassSummary() vs alternatives
caret ships five summaryFunctions plus the postResample helper. The right one depends on outcome type, class balance, and which metrics drive the decision.
| summaryFunction | Outcome type | Returned metrics | Picks when |
|---|---|---|---|
twoClassSummary |
Two-class factor | ROC, Sens, Spec | Binary classification, balanced or moderately imbalanced |
prSummary |
Two-class factor | AUC (PR), Precision, Recall, F | Heavily imbalanced binary, precision-recall focus |
multiClassSummary |
3+ class factor | Accuracy, Kappa, per-class metrics | Multinomial outcomes |
mnLogLoss |
Two- or multi-class | logLoss | Probability calibration matters more than hard label |
defaultSummary |
Numeric (regression) | RMSE, Rsquared, MAE | Regression resampling |
postResample |
Either | RMSE/Rsq/MAE or Accuracy/Kappa | Quick scoring of two vectors outside train() |
Pick twoClassSummary when the outcome is a two-level factor and ROC is the headline metric. If positives are rare (under 10 percent of rows), switch to prSummary so PR-AUC reflects ranking on the minority class. Use mnLogLoss for calibration, and confusionMatrix() for one-off scoring outside the resample loop.
Common pitfalls
Four mistakes account for most twoClassSummary() bugs. Each has a quick fix.
The fix is to add classProbs = TRUE. caret cannot compute ROC AUC without a probability column for each class.
The probability columns must be named exactly after the factor levels. Rename prob1 to yes and prob2 to no and the call succeeds.
caret treats levels(obs)[1] as the positive class. Alphabetical ordering makes "No" the event, which inverts Sensitivity and Specificity from what the reader expects. Set the order explicitly: factor(x, levels = c("Yes", "No")).
caret uses level names as column names for class probabilities. Spaces, dashes, and numeric starts trigger train() errors when columns are bound to the data frame. Use make.names() or rename levels to syntactic names: levels(truth_bad) <- c("class0", "class1").
Try it yourself
Try it: Build a 5-fold CV pipeline on binary_iris (versicolor vs virginica) predicting Species from Sepal.Length and Sepal.Width with glm. Wire twoClassSummary explicitly into trainControl() and tune on ROC. Save the mean ROC to ex_roc.
Click to reveal solution
Explanation: Restricting predictors to sepal length and width strips out the highly discriminative petal features, so ROC drops from near 1 (with all four predictors) to around 0.80. Averaging fit$resample$ROC reproduces the mean-of-folds AUC that fit$results$ROC reports.
Related caret functions
The metric machinery sits one call away:
defaultSummary()for regression resample scoring. See caret defaultSummary() in R.postResample()for two-vector scoring outside the resample loop. See caret postResample() in R.trainControl()for swapping summaryFunctions and configuring resamples. See caret trainControl() in R.train()for the resample-and-tune driver that calls twoClassSummary. See caret train() in R.confusionMatrix()for the full classification scorecard with 15+ metrics. See caret confusionMatrix() in R.
For the upstream reference, see the caret package documentation.
FAQ
What does twoClassSummary() return?
For a data frame with a two-level factor obs, a matching pred factor, and one probability column per level, twoClassSummary() returns a length-three named numeric vector: ROC, Sens, and Spec. ROC is the area under the ROC curve computed with pROC::roc(), Sensitivity is the recall on the first factor level (the event), and Specificity is the recall on the second level. caret rbinds one such row per fold into fit$resample and averages into fit$results.
Why does my ROC come out as 0.5 or look reversed?
The first level of the outcome factor is treated as the positive class. If your factor is c("No", "Yes") alphabetically, "No" is positive and the probability column caret reads is "No", which inverts the AUC. The fix is to set levels explicitly with factor(x, levels = c("Yes", "No")) so the event class is first. Always inspect levels(data$obs) before training.
twoClassSummary or prSummary for imbalanced data?
Use prSummary when the positive class is under roughly 10 percent of rows. The ROC curve under twoClassSummary is dominated by the abundant negative class, so a model that ranks the rare positives poorly can still report a high AUC. The precision-recall AUC that prSummary returns is more sensitive to performance on the minority class. For moderately imbalanced data (20 to 40 percent positive), twoClassSummary remains the standard choice.
Can I tune on Sensitivity instead of ROC?
Yes. Pass metric = "Sens" and maximize = TRUE to train(); caret picks the hyperparameter row with the highest mean Sensitivity. Tuning on Sens alone is risky because a model that predicts everything as the event scores Sens = 1. Pair it with a Specificity floor, or define a custom metric that combines both.
How does twoClassSummary differ from confusionMatrix()?
twoClassSummary(data, lev) returns three resample-friendly metrics from a frame with probability columns; confusionMatrix(pred, obs) returns 15-plus metrics from two factor vectors. twoClassSummary is what train() calls per fold; confusionMatrix is what you call once on a held-out test set for the final report.