caret confusionMatrix() in R: Evaluate Classification Models
The confusionMatrix() function in caret takes predicted and observed class labels and returns the confusion matrix plus a battery of classification metrics: accuracy, Kappa, sensitivity, specificity, PPV, NPV, F1, balanced accuracy, and prevalence. It works for two-class and multi-class outcomes from a single call.
confusionMatrix(pred, ref) # default, picks first level as positive confusionMatrix(pred, ref, positive = "yes") # set positive class confusionMatrix(table(pred, ref)) # pass a precomputed table confusionMatrix(pred, ref, mode = "prec_recall") # F1, precision, recall confusionMatrix(pred, ref, mode = "everything") # all 15 metrics confusionMatrix(pred, ref)$byClass["F1"] # extract one metric cm$table # the raw 2x2 (or k x k) matrix
Need explanation? Read on for examples and pitfalls.
What confusionMatrix() does in one sentence
confusionMatrix() is caret's classifier scorecard. You hand it a vector of predicted class labels and a vector of reference (true) labels, and it cross-tabulates them, then derives every standard classification metric from that table in one call. The return object holds the raw table, overall metrics (accuracy and Kappa with confidence intervals), and per-class metrics (sensitivity, specificity, PPV, NPV, prevalence, balanced accuracy).
The function does not need a fitted model. It only sees labels, so it works with any classifier in R, not just caret-trained ones. Pass the output of predict() from glm, randomForest, xgboost, or nnet and you get the same scorecard. The same call covers two-class and multi-class problems: for k classes the per-class metric table widens to k rows, one per class as the positive label.
confusionMatrix() syntax and arguments
The signature accepts predictions in two shapes: vectors of labels, or a precomputed contingency table. Both produce the same output object.
The two main calling styles are:
confusionMatrix(data, reference, positive = NULL, dnn = c("Prediction", "Reference"),
mode = "sens_spec", ...)
confusionMatrix(table, positive = NULL, prevalence = NULL, mode = "sens_spec", ...)
data: predicted class labels (factor). Same length and levels asreference.reference: ground-truth class labels (factor).positive: which factor level counts as the positive class for two-class metrics. Defaults to the first level, which is rarely what you want.table: a precomputedtable()of predictions versus reference, used when you already cross-tabbed the labels yourself.mode:"sens_spec"(default),"prec_recall", or"everything". Controls which per-class metrics get reported.prevalence: optional override of the base rate. Useful when the test set under- or over-samples the positive class.
c("a","b","c")) against a factor with two (c("a","b")) raises the data cannot have more levels than the reference. Coerce both with factor(..., levels = expected_levels) before the call.confusionMatrix() examples by use case
1. Two-class confusion matrix with default metrics
Build a binary outcome from iris and score a logistic-regression-style classifier. The default mode = "sens_spec" reports accuracy, Kappa, sensitivity, and specificity.
The overall section gives accuracy with a 95 percent exact-binomial confidence interval; cm$byClass gives sensitivity, specificity, PPV, NPV, prevalence, and balanced accuracy. Kappa adjusts accuracy for chance agreement, which matters when one class dominates. A Kappa near one beats constant prediction; near zero is no better than guessing the majority class.
2. Multi-class confusion matrix
For three or more classes, caret reports metrics per class instead of for one positive class. The same call works; only the output table widens.
Each row of cm_multi$byClass is one class treated as positive against everything else (one-vs-rest). The per-class view exposes asymmetries a single accuracy number hides. For multi-class problems the positive argument is ignored; macro and micro averages must be computed from byClass manually.
3. Set the positive class explicitly
The default positive class is the first factor level, alphabetical for character data. That is rarely the medically or commercially interesting class. Always set positive for binary problems.
The matrix is identical either way; only which row of byClass is reported as sensitivity versus specificity flips. Mislabeling the positive class is the most common reason metrics look reasonable for the wrong reason.
4. Get F1, precision, and recall
Switch to mode = "prec_recall" for precision, recall, and F1, or mode = "everything" for all 15 metrics.
Precision is the same as positive predictive value (PPV); recall is the same as sensitivity. caret renames them under mode = "prec_recall" to match the information-retrieval convention. F1 is the harmonic mean of precision and recall, so it drops sharply if either is weak. The mode = "everything" option reports all 15 per-class metrics at once.
5. Pass a precomputed table
If you already have a contingency table from table(), hand it in directly. Useful when predictions and labels arrive as a precomputed cross-tabulation.
AccuracyPValue tests whether observed accuracy beats the no-information rate (always predicting the majority class). A low p-value means the model beats a constant predictor. The row dimension of the input table is the prediction and the column is the reference; reverse this and sensitivity and specificity flip.
confusionMatrix() vs alternatives
caret's confusionMatrix() is the most comprehensive scorecard; tidymodels' yardstick::conf_mat() and base table() cover the simpler cases. The choice usually comes down to how much downstream tidyverse code consumes the output.
| Tool | Returns | Multi-class metrics | Confidence intervals on accuracy |
|---|---|---|---|
caret::confusionMatrix() |
List of overall + per-class metrics | Yes, one-vs-rest per class | Yes, exact binomial |
yardstick::conf_mat() |
A tibble-friendly conf_mat object | Yes, via summary() |
No (use yardstick::accuracy() separately) |
base::table() |
Raw contingency table | None | None |
MLmetrics::ConfusionMatrix() |
A plain matrix | Limited | None |
Reach for confusionMatrix() when you want one call and a complete report. Use yardstick::conf_mat() when the rest of your code is tidymodels or you want to pipe into ggplot. The metric numbers are the same across libraries; what changes is the output shape. See the official caret reference at topepo.github.io/caret/measuring-performance.html for the full metric list.
Common pitfalls
Pitfall 1: factor levels mismatch between prediction and reference. If the model never predicted some classes, those levels are missing from the prediction factor. caret then refuses to compare. Force the levels: pred <- factor(pred, levels = levels(ref)).
Pitfall 2: leaving the positive class on the default first level. A "no" first-positive flips sensitivity and specificity in the report, leading to plausible but inverted conclusions. Always pass positive = "yes" (or your domain-relevant level) for binary outcomes.
Pitfall 3: comparing accuracy across imbalanced classes. If 95 percent of records are negative, predicting "negative" every time scores 95 percent accuracy. Kappa, balanced accuracy, and F1 in byClass are designed for that case; the AccuracyPValue flags it.
Pitfall 4: treating one test matrix as the model's quality. A single train/test split gives one matrix; cross-validation gives one per fold. For stable estimates, average per-fold metrics from a cross-validated fit instead.
confusionMatrix() does not score probabilities. It needs hard class labels. For probability-based metrics (AUC, log loss, Brier score), pair it with pROC::roc() or caret::twoClassSummary() inside trainControl().Try it yourself
Try it: Score a kNN classifier on the iris test split (use the train_d, test_d, ref, pred objects defined earlier in this post) and pull out the balanced accuracy column from byClass. Save the resulting vector to ex_bal_acc.
Click to reveal solution
Explanation: byClass is a matrix when there are 3+ classes (one row per class, one column per metric). Subset with [, "Balanced Accuracy"] to pull a single metric across classes.
Related caret functions
After confusionMatrix(), these caret functions round out classification evaluation:
train(): fits and resamples the model whose predictions you scorepredict.train(): produces the class-label vector you feed in asdatatrainControl(classProbs = TRUE): enables probability output for ROC-based tuningtwoClassSummary(): drop-in forsummaryFunctionto get ROC, sensitivity, specificity during resamplingvarImp(): model-agnostic variable importance for a fitted classifier
FAQ
What does caret confusionMatrix() return exactly?
It returns an S3 object with five named slots: table (the raw k x k contingency table), overall (accuracy, Kappa, accuracy CI, no-information rate, McNemar p-value), byClass (per-class sensitivity, specificity, PPV, NPV, prevalence, balanced accuracy), mode, and dots. Subset cm$overall["Accuracy"] to pull individual numbers.
How do I plot a caret confusion matrix?
The matrix lives in cm$table. Convert to a data frame and use ggplot2::geom_tile() for a heatmap. For a quick base R version, fourfoldplot(cm$table) works for 2x2 matrices and mosaicplot(cm$table) works for k x k.
Why is sensitivity 100 percent but the model still bad?
Sensitivity ignores the negative class. A model that predicts "positive" for every record has 100 percent sensitivity and 0 percent specificity. Always read sensitivity, specificity, and prevalence together. Balanced accuracy and Kappa correct for the asymmetry.
Can confusionMatrix() handle multi-class outputs?
Yes, with no extra arguments. caret detects k > 2 levels and reports per-class metrics one-vs-rest. There is no built-in macro or micro averaging; compute that manually as mean(cm$byClass[, "F1"], na.rm = TRUE) or switch to yardstick::f_meas(df, truth, estimate, estimator = "macro").