Rr‑statistics.co

Confusion Matrix Interpreter

A confusion matrix breaks down a classifier's predictions into true positives, false positives, true negatives, and false negatives. Paste the counts (or full caret::confusionMatrix output) to get accuracy, precision, recall, F1, kappa, MCC, and guidance on which metric to trust when classes are imbalanced.

i New to confusion matrices? Read the 4-min primer

What a confusion matrix is. A k by k table of predicted vs. actual class counts. For a binary classifier the four cells are TP (true positive), FP (false positive), FN (false negative), and TN (true negative). Every classifier metric is a different way of squashing those four numbers into one summary, so the matrix itself is the single source of truth.

How to read accuracy, precision, recall, F1. Accuracy is (TP + TN) / N: how often the model is right overall. Precision is TP / (TP + FP): when the model says “positive,” how often it’s correct. Recall (sensitivity) is TP / (TP + FN): of the actual positives, how many we caught. F1 is the harmonic mean of precision and recall, and it punishes imbalance between the two; F0.5 weights precision more, F2 weights recall more.

Picking the right metric for your costs. Symmetric costs and roughly balanced classes? Accuracy + F1 are fine. False positives are expensive (spam blocking real email, biopsies on healthy patients)? Optimise precision; report F0.5. False negatives are expensive (missing a cancer, missing fraud)? Optimise recall; report F2. Want a single robust score that does not flatter trivial classifiers? Use MCC or balanced accuracy.

Class imbalance gotchas. A model that always predicts the majority class on 99%-prevalence data scores 99% accuracy and is useless. On imbalanced problems, ignore raw accuracy and report MCC, balanced accuracy, and per-class precision and recall instead. Cohen’s kappa subtracts off the agreement you would get by chance; MCC is a correlation between predicted and actual labels and ranges from -1 (perfectly wrong) through 0 (random) to +1 (perfect).

Binary or multi-class · Balanced or imbalanced · Runs in your browser

Try a real-world example to load.

📧 Spam classifier

A roughly balanced inbox: 100 spam, 100 ham. The model has good precision and decent recall.

R code RUNNABLE
R Reproduce in R

        
Confusion heatmap INTERACTIVE
Inference

Read more Anatomy of the metrics
accuracy = (TP + TN) / N precision = TP / (TP + FP) recall = TP / (TP + FN)
The accuracy paradox. Accuracy is (correct) / (total) and feels like the natural summary. On imbalanced data it is not. A “classifier” that always predicts the majority class scores at the prevalence of that class with no learning at all. Whenever your minority class is below ~20%, treat raw accuracy as suspect and lean on MCC or balanced accuracy.
F1 = 2 * P * R / (P + R) F_beta = (1 + beta^2) * P * R / (beta^2 * P + R)
Precision vs. recall, and F-beta. Precision rises when the model is conservative; recall rises when it is aggressive. F1 is the harmonic mean and weights them equally. F0.5 (beta = 0.5) weights precision more, useful when false positives are expensive (spam, biopsies). F2 (beta = 2) weights recall more, useful when false negatives are expensive (missing a cancer, missing fraud). Pick beta to match what one mistake costs in the real world.
kappa = (po - pe) / (1 - pe) po = observed agreement (= accuracy) pe = chance agreement
Cohen’s kappa. Subtracts off the agreement you would get by chance, given the row and column margins. Landis & Koch read: 0 to 0.2 slight, 0.2 to 0.4 fair, 0.4 to 0.6 moderate, 0.6 to 0.8 substantial, > 0.8 almost perfect. For ordinal classes you can weight by the squared distance between predicted and actual rank (quadratic kappa), which penalises “3 vs. 5” more than “3 vs. 4.”
MCC = (TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))
Matthews Correlation Coefficient. A correlation between predicted and actual labels. Ranges from -1 (perfectly wrong) through 0 (random) to +1 (perfect). MCC is symmetric in the two classes and only goes high when the classifier does well on both, so it is the cleanest one-number summary on imbalanced binary data. The multi-class generalisation (Gorodkin) extends the same idea to k by k.
balanced_acc = (sensitivity + specificity) / 2 macro_F1 = mean over classes of per-class F1 micro_F1 = global F1 from pooled TP/FP/FN
Balanced accuracy and macro vs. micro. Balanced accuracy averages per-class recall, so flipping the prevalence does not change it. Macro-averaging gives every class equal weight (so a rare class contributes as much as a common one); micro-averaging pools the counts (so it follows class frequency). On heavy imbalance, macro is the honest read; micro tracks accuracy. On multi-class, always report both plus per-class precision/recall.
Caveats When this is the wrong tool
If you have…
Use instead
Multi-class with one-vs-rest needs
This tool computes per-class one-vs-rest precision/recall/F1 plus macro and weighted (support-weighted) averages. For one-vs-one ROC, you need a probability-based tool.
Predicted probabilities, not labels
You can compute AUC, calibration, and Brier score - those need probabilities, not a single threshold. See the effect-size converter; full ROC tool planned.
Cost-sensitive evaluation
Set beta in F-beta to match the cost ratio of FN vs. FP, or compute expected-cost directly: C = c_FP * FP + c_FN * FN. The matrix has all you need.
Time series with concept drift
A static confusion matrix mixes pre- and post-drift errors. Slice by time window, or compute a rolling MCC/F1.
Multi-label outputs (multiple classes per case)
Different metric set (subset accuracy, Hamming loss). Not v1 of this tool.
Further reading

Numerical accuracy: Wilson interval for accuracy CI; MCC and kappa computed from raw counts (no rounding). caret::confusionMatrix and mltools::mcc give matching values to 5 decimal places.