Confusion Matrix Interpreter
A confusion matrix breaks down a classifier's predictions into true positives, false positives, true negatives, and false negatives. Paste the counts (or full caret::confusionMatrix output) to get accuracy, precision, recall, F1, kappa, MCC, and guidance on which metric to trust when classes are imbalanced.
New to confusion matrices? Read the 4-min primer ▾
What a confusion matrix is. A k by k table of predicted vs. actual class counts. For a binary classifier the four cells are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). Every classifier metric is a different way of squashing those four numbers into one summary, so the matrix itself is the single source of truth.
How to read accuracy, precision, recall, F1. Accuracy is (TP + TN) / N: how often the model is right overall. Precision is TP / (TP + FP): when the model says “positive,” how often it’s correct. Recall (sensitivity) is TP / (TP + FN): of the actual positives, how many we caught. F1 is the harmonic mean of precision and recall, and it punishes imbalance between the two; F0.5 weights precision more, F2 weights recall more.
Picking the right metric for your costs. Symmetric costs and roughly balanced classes? Accuracy + F1 are fine. False positives are expensive (spam blocking real email, biopsies on healthy patients)? Optimise precision; report F0.5. False negatives are expensive (missing a cancer, missing fraud)? Optimise recall; report F2. Want a single robust score that does not flatter trivial classifiers? Use MCC or balanced accuracy.
Class imbalance gotchas. A model that always predicts the majority class on 99%-prevalence data scores 99% accuracy and is useless. On imbalanced problems, ignore raw accuracy and report MCC, balanced accuracy, and per-class precision and recall instead. Cohen’s kappa subtracts off the agreement you would get by chance; MCC is a correlation between predicted and actual labels and ranges from -1 (perfectly wrong) through 0 (random) to +1 (perfect).
Try a real-world example to load.
A roughly balanced inbox: 100 spam, 100 ham. The model has good precision and decent recall.
We tallied your classifier's predictions against the ground truth and computed the standard performance metrics.
Read more Anatomy of the metrics
Caveats When this is the wrong tool
- If you have…
- Use instead
- Multi-class with one-vs-rest needs
- This tool computes per-class one-vs-rest precision/recall/F1 plus macro and weighted (support-weighted) averages. For one-vs-one ROC, you need a probability-based tool.
- Predicted probabilities, not labels
- You can compute AUC, calibration, and Brier score - those need probabilities, not a single threshold. See the effect-size converter; full ROC tool planned.
- Cost-sensitive evaluation
- Set beta in F-beta to match the cost ratio of FN vs. FP, or compute expected-cost directly: C = c_FP * FP + c_FN * FN. The matrix has all you need.
- Time series with concept drift
- A static confusion matrix mixes pre- and post-drift errors. Slice by time window, or compute a rolling MCC/F1.
- Multi-label outputs (multiple classes per case)
- Different metric set (subset accuracy, Hamming loss). Not v1 of this tool.
- Logistic regression in R, end to end - including how to threshold predicted probabilities into a confusion matrix.
- Sensitivity, specificity and the prosecutor’s fallacy - why LR+ matters for diagnostic tests.
- AUC vs accuracy - what each one is and isn’t measuring.
Numerical accuracy: Wilson interval for accuracy CI; MCC and kappa computed from raw counts (no rounding). caret::confusionMatrix and mltools::mcc give matching values to 5 decimal places.