Confusion Matrix Interpreter
A confusion matrix breaks down a classifier's predictions into true positives, false positives, true negatives, and false negatives. Paste the counts (or full caret::confusionMatrix output) to get accuracy, precision, recall, F1, kappa, MCC, and guidance on which metric to trust when classes are imbalanced.
New to confusion matrices? Read the 4-min primer ▾
What a confusion matrix is. A k by k table of predicted vs. actual class counts. For a binary classifier the four cells are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). Every classifier metric is a different way of squashing those four numbers into one summary, so the matrix itself is the single source of truth.
How to read accuracy, precision, recall, F1. Accuracy is (TP + TN) / N: how often the model is right overall. Precision is TP / (TP + FP): when the model says “positive,” how often it’s correct. Recall (sensitivity) is TP / (TP + FN): of the actual positives, how many we caught. F1 is the harmonic mean of precision and recall, and it punishes imbalance between the two; F0.5 weights precision more, F2 weights recall more.
Picking the right metric for your costs. Symmetric costs and roughly balanced classes? Accuracy + F1 are fine. False positives are expensive (spam blocking real email, biopsies on healthy patients)? Optimise precision; report F0.5. False negatives are expensive (missing a cancer, missing fraud)? Optimise recall; report F2. Want a single robust score that does not flatter trivial classifiers? Use MCC or balanced accuracy.
Class imbalance gotchas. A model that always predicts the majority class on 99%-prevalence data scores 99% accuracy and is useless. On imbalanced problems, ignore raw accuracy and report MCC, balanced accuracy, and per-class precision and recall instead. Cohen’s kappa subtracts off the agreement you would get by chance; MCC is a correlation between predicted and actual labels and ranges from -1 (perfectly wrong) through 0 (random) to +1 (perfect).
Try a real-world example to load.
A roughly balanced inbox: 100 spam, 100 ham. The model has good precision and decent recall.
Read more Anatomy of the metrics
Caveats When this is the wrong tool
- If you have…
- Use instead
- Multi-class with one-vs-rest needs
- This tool computes per-class one-vs-rest precision/recall/F1 plus macro and weighted (support-weighted) averages. For one-vs-one ROC, you need a probability-based tool.
- Predicted probabilities, not labels
- You can compute AUC, calibration, and Brier score - those need probabilities, not a single threshold. See the effect-size converter; full ROC tool planned.
- Cost-sensitive evaluation
- Set beta in F-beta to match the cost ratio of FN vs. FP, or compute expected-cost directly: C = c_FP * FP + c_FN * FN. The matrix has all you need.
- Time series with concept drift
- A static confusion matrix mixes pre- and post-drift errors. Slice by time window, or compute a rolling MCC/F1.
- Multi-label outputs (multiple classes per case)
- Different metric set (subset accuracy, Hamming loss). Not v1 of this tool.
- Logistic regression in R, end to end - including how to threshold predicted probabilities into a confusion matrix.
- Sensitivity, specificity and the prosecutor’s fallacy - why LR+ matters for diagnostic tests.
- AUC vs accuracy - what each one is and isn’t measuring.
Numerical accuracy: Wilson interval for accuracy CI; MCC and kappa computed from raw counts (no rounding). caret::confusionMatrix and mltools::mcc give matching values to 5 decimal places.