caret multiClassSummary() in R: Multi-Class Metrics
The multiClassSummary() function in caret is the multinomial summaryFunction that trainControl() calls on every resample when the outcome is a factor with 3 or more levels. It accepts a data frame with obs, pred, and one probability column per class, then returns 14 metrics including macro-averaged F1, Sensitivity, Specificity, mean one-vs-all AUC, and logLoss. Wiring it in is the standard way to tune a caret model on AUC or logLoss for multi-class problems.
multiClassSummary(df, lev = levels(df$obs)) # direct call trainControl(summaryFunction = multiClassSummary, classProbs = TRUE) # wire-in train(..., metric = "AUC") # optimise mean one-vs-all AUC train(..., metric = "logLoss", maximize = FALSE) # optimise logLoss (smaller is better) train(..., metric = "Mean_F1") # optimise macro F1 fit$resample[, c("AUC", "logLoss", "Mean_F1")] # per-fold metrics levels(factor_outcome) # all class labels
Need explanation? Read on for examples and pitfalls.
What multiClassSummary() does in one sentence
multiClassSummary() is caret's multinomial scoring contract. caret calls it on each fold's held-out predictions when the outcome is a factor with 3 or more levels and classProbs = TRUE. The body computes 14 metrics in a named numeric vector, including logLoss, mean one-vs-all AUC, prAUC, Accuracy, Kappa, and six macro-averaged class metrics.
The Mean_ prefix on most returned names signals macro averaging: caret computes the metric one-vs-all for each class, then takes the unweighted mean across classes. For iris with three species, Mean_F1 is the simple average of three per-class F1 scores.
Mean_F1. For imbalanced multinomials this is usually the right headline because it surfaces failure on rare classes; if you need frequency-weighted averages, compute them from confusionMatrix() directly.multiClassSummary() syntax and arguments
The signature has three arguments and the data frame must carry obs, pred, and one probability column per class. caret fixes the shape so any summaryFunction is interchangeable across binary, multi-class, and regression.
The required argument is data: a data frame with obs (truth factor), pred (predicted factor), and one numeric column per factor level holding class probabilities. Probability column names must match factor levels exactly. The lev argument is the level vector; train() passes it automatically, but you pass it yourself when calling outside the resample loop.
The return is a named numeric vector of length 14. train() rbinds one row per resample into fit$resample and averages into fit$results. Any returned name is a valid metric = value for tuning.
requireNamespace("ModelMetrics", quietly = TRUE) before a long batch run.multiClassSummary() examples by use case
Three patterns cover the common calls: wire-in for CV, tune on AUC, tune on logLoss for calibration. Each reuses the same function with different metric =.
Each row is the mean across five folds. caret picks the mtry with the highest AUC because metric = "AUC" matches a returned column. The full results frame also carries standard-deviation columns (AUCSD, logLossSD, etc.) for assessing stability.
Pass maximize = FALSE whenever the metric improves as it decreases. logLoss penalises confidently wrong predictions far harder than wrongly ranked ones, so tuning on it produces better-calibrated probabilities than tuning on AUC.
Per-fold rows reveal fragility the averaged row hides. A logLoss range from 0.08 to 0.20 across folds is wide for iris, hinting that one fold contains the harder versicolor-virginica boundary cases.
fit$resample before reporting averages. A model with mean Mean_F1 of 0.95 and per-fold range 0.83 to 1.00 is meaningfully less reliable than one with mean 0.94 and range 0.93 to 0.95. Variance matters for stakeholder trust.multiClassSummary() vs alternatives
caret ships five summaryFunctions; choose by outcome type and which metric drives the decision. Each returns a different named vector shape.
| summaryFunction | Outcome | Returned metrics | Picks when |
|---|---|---|---|
multiClassSummary |
3+ class factor | 14 metrics: AUC, logLoss, prAUC, Mean_F1, ... | Multinomial classification, macro-averaged headlines |
twoClassSummary |
Two-class factor | ROC, Sens, Spec | Binary classification with ROC focus |
prSummary |
Two-class factor | PR AUC, Precision, Recall, F | Heavily imbalanced binary |
mnLogLoss |
Two- or multi-class | logLoss | Probability calibration is the only metric |
defaultSummary |
Numeric | RMSE, Rsquared, MAE | Regression resampling |
Pick multiClassSummary when the outcome has 3 or more levels and a full scorecard at every resample is useful. If only logLoss matters, mnLogLoss is lighter and avoids the ModelMetrics and pROC dependencies. For one-vs-rest binary scoring on a single class of interest, refactor the outcome to a two-level factor and use twoClassSummary.
Common pitfalls
Four mistakes account for nearly every multiClassSummary() bug. Each has a quick fix.
The fix is to set classProbs = TRUE. Without per-class probability columns, AUC, logLoss, and prAUC cannot be computed and caret aborts before training starts.
Rename prob1, prob2, prob3 to a, b, c. caret indexes probability columns by level name; mismatched names trigger the missing-columns error.
caret reuses factor level names as column names for the probability matrix. Spaces, dashes, and numeric starts break train() when probability columns are bound to the data frame. Call levels(x) <- make.names(levels(x)) to convert to class.1, class.2, class.3 automatically.
When no rows are predicted as a class, that class's precision is 0/0 and F1 inherits the NaN. Mean_F1 then averages a NaN and returns NaN. For tiny held-out samples, switch the tuning metric to Mean_Balanced_Accuracy or Accuracy, which remain finite when a class is missing from predictions.
confusionMatrix()$byClass directly.Try it yourself
Try it: Train a k-nearest-neighbours classifier on the full 3-class iris with 10-fold CV. Wire multiClassSummary explicitly into trainControl(), tune k over c(3, 5, 7, 9) on logLoss, and save the chosen k to ex_k.
Click to reveal solution
Explanation: Iris is small and clean, so larger k smooths the predicted probability vectors, which lowers logLoss even when held-out accuracy is identical. Tuning on logLoss rather than Accuracy rewards better-calibrated probabilities.
Related caret functions
The summaryFunction family lives in caret's resampling layer:
twoClassSummary()for binary outcomes with ROC, Sens, Spec. See caret twoClassSummary() in R.defaultSummary()for regression resample scoring. See caret defaultSummary() in R.postResample()for two-vector scoring outside the resample loop. See caret postResample() in R.trainControl()for wiring custom summaryFunctions. See caret trainControl() in R.confusionMatrix()for the per-class scorecard with 15+ metrics. See caret confusionMatrix() in R.
For implementation details, see the caret performance documentation.
FAQ
What metrics does multiClassSummary() return?
multiClassSummary() returns a named numeric vector with 14 elements: logLoss, AUC (mean one-vs-all), prAUC, Accuracy, Kappa, Mean_F1, Mean_Sensitivity, Mean_Specificity, Mean_Pos_Pred_Value, Mean_Neg_Pred_Value, Mean_Precision, Mean_Recall, Mean_Detection_Rate, and Mean_Balanced_Accuracy. The Mean_ prefix indicates a macro-averaged metric: caret computes the metric one-vs-all for each class, then takes the unweighted mean across classes.
How is multiClassSummary AUC computed for 3+ classes?
caret computes a one-vs-all ROC AUC for each class (treating that class as positive and all others as negative), then returns the unweighted mean. For iris with three species, the reported AUC is the average of three binary AUCs. This Hand-Till style estimator is approximate; for a strict pairwise multi-class AUC compute it with pROC::multiclass.roc() on the same probability matrix.
Should I tune on AUC or logLoss for multi-class?
Tune on AUC when ranking matters and the downstream consumer thresholds probabilities. Tune on logLoss when calibration matters, because logLoss penalises confidently wrong predictions much harder than wrongly ranked ones. Both metrics require classProbs = TRUE. logLoss uses the full probability matrix per row; AUC uses one column per class. For most projects, train one model on each metric and compare on a held-out test set.
Why is Mean_F1 sometimes NaN in my results?
Mean_F1 averages per-class F1 scores, and F1 = 2 Precision Recall / (Precision + Recall). If a class has zero predicted instances on a fold, Precision is 0/0 = NaN, which propagates through F1 and the mean. This usually means the held-out set is too small or the model collapses to a single class on that fold. Switch to Mean_Balanced_Accuracy for a more stable headline on tiny samples.
Does multiClassSummary work for 2-class outcomes?
Yes, but you give up information. With a two-level factor, multiClassSummary still computes all 14 metrics, but the Mean_ versions degenerate to the binary metric computed twice (each level as positive in turn) and averaged. twoClassSummary returns the three classical metrics directly and is cheaper. Reserve multiClassSummary for factors with 3 or more levels.