Confusion Matrix Interpreter

A confusion matrix breaks down a classifier's predictions into true positives, false positives, true negatives, and false negatives. Paste the counts (or full caret::confusionMatrix output) to get accuracy, precision, recall, F1, kappa, MCC, and guidance on which metric to trust when classes are imbalanced.

Binary or multi-class · Balanced or imbalanced · Runs in your browser

Try a real-world example to load.

📧 Spam classifier

A roughly balanced inbox: 100 spam, 100 ham. The model has good precision and decent recall.

Output

R code RUNNABLE

R Reproduce in R

Confusion heatmap INTERACTIVE

Inference

We tallied your classifier's predictions against the ground truth and computed the standard performance metrics.

Read more Anatomy of the metrics

accuracy = (TP + TN) / N precision = TP / (TP + FP) recall = TP / (TP + FN)

The accuracy paradox. Accuracy is (correct) / (total) and feels like the natural summary. On imbalanced data it is not. A “classifier” that always predicts the majority class scores at the prevalence of that class with no learning at all. Whenever your minority class is below ~20%, treat raw accuracy as suspect and lean on MCC or balanced accuracy.

F1 = 2 * P * R / (P + R) F_beta = (1 + beta^2) * P * R / (beta^2 * P + R)

Precision vs. recall, and F-beta. Precision rises when the model is conservative; recall rises when it is aggressive. F1 is the harmonic mean and weights them equally. F0.5 (beta = 0.5) weights precision more, useful when false positives are expensive (spam, biopsies). F2 (beta = 2) weights recall more, useful when false negatives are expensive (missing a cancer, missing fraud). Pick beta to match what one mistake costs in the real world.

kappa = (po - pe) / (1 - pe) po = observed agreement (= accuracy) pe = chance agreement

Cohen’s kappa. Subtracts off the agreement you would get by chance, given the row and column margins. Landis & Koch read: 0 to 0.2 slight, 0.2 to 0.4 fair, 0.4 to 0.6 moderate, 0.6 to 0.8 substantial, > 0.8 almost perfect. For ordinal classes you can weight by the squared distance between predicted and actual rank (quadratic kappa), which penalises “3 vs. 5” more than “3 vs. 4.”

MCC = (TP*TN - FP*FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))

Matthews Correlation Coefficient. A correlation between predicted and actual labels. Ranges from -1 (perfectly wrong) through 0 (random) to +1 (perfect). MCC is symmetric in the two classes and only goes high when the classifier does well on both, so it is the cleanest one-number summary on imbalanced binary data. The multi-class generalisation (Gorodkin) extends the same idea to k by k.

balanced_acc = (sensitivity + specificity) / 2 macro_F1 = mean over classes of per-class F1 micro_F1 = global F1 from pooled TP/FP/FN

Balanced accuracy and macro vs. micro. Balanced accuracy averages per-class recall, so flipping the prevalence does not change it. Macro-averaging gives every class equal weight (so a rare class contributes as much as a common one); micro-averaging pools the counts (so it follows class frequency). On heavy imbalance, macro is the honest read; micro tracks accuracy. On multi-class, always report both plus per-class precision/recall.

Caveats When this is the wrong tool

If you have…: Use instead
Multi-class with one-vs-rest needs: This tool computes per-class one-vs-rest precision/recall/F1 plus macro and weighted (support-weighted) averages. For one-vs-one ROC, you need a probability-based tool.
Predicted probabilities, not labels: You can compute AUC, calibration, and Brier score - those need probabilities, not a single threshold. See the effect-size converter; full ROC tool planned.
Cost-sensitive evaluation: Set beta in F-beta to match the cost ratio of FN vs. FP, or compute expected-cost directly: C = c_FP * FP + c_FN * FN. The matrix has all you need.
Time series with concept drift: A static confusion matrix mixes pre- and post-drift errors. Slice by time window, or compute a rolling MCC/F1.
Multi-label outputs (multiple classes per case): Different metric set (subset accuracy, Hamming loss). Not v1 of this tool.

Further reading

Logistic regression in R, end to end - including how to threshold predicted probabilities into a confusion matrix.
Sensitivity, specificity and the prosecutor’s fallacy - why LR+ matters for diagnostic tests.
AUC vs accuracy - what each one is and isn’t measuring.

Numerical accuracy: Wilson interval for accuracy CI; MCC and kappa computed from raw counts (no rounding). caret::confusionMatrix and mltools::mcc give matching values to 5 decimal places.