ROC / AUC Calculator
An ROC curve traces the trade-off between true-positive and false-positive rates as you sweep a classifier's threshold; the AUC summarizes overall discrimination. Paste outcomes plus predicted scores to get the curve, AUC with DeLong CI, three optimal thresholds (Youden, F1, cost-weighted), and a calibration check.
New to ROC and AUC? Read the 4-min primer ▾
What ROC and AUC mean. A binary classifier produces a score for each case. As you sweep the decision threshold from high to low, you trade specificity for sensitivity. The ROC curve plots true-positive rate (sensitivity) against false-positive rate (1 - specificity) at every threshold. AUC is the area under that curve and equals the probability that a randomly drawn positive case scores higher than a randomly drawn negative one. AUC of 0.5 is random; 1.0 is perfect; 0 is perfectly wrong.
How to read the curve and the threshold. The top-left corner is perfect. The diagonal is random. A curve that hugs the top-left has good discrimination. Each point on the curve corresponds to a specific threshold; sliding the threshold trades false alarms for misses. The threshold marker on the curve shows where you currently sit, and the confusion matrix updates so you can see exactly how many TPs, FPs, FNs, and TNs that threshold produces.
Picking the right optimum. Youden’s J maximises sensitivity + specificity - 1, which is the same as the point on the curve farthest above the diagonal. F1 maximises 2 P R / (P + R), which weighs precision and recall equally and is preferred when classes are balanced and you want a single score. Cost-weighted minimises FP cost_FP + FN cost_FN, which is the right answer when you can put a number on a missed positive vs. a false alarm. They usually disagree, and that disagreement is information about your problem.
When ROC is the wrong tool. On heavily imbalanced data (positives below 5%), ROC flatters you because changes in FP count barely move the FPR; report a precision-recall curve too. For multi-class problems, ROC is one-vs-rest per class plus a macro average; this tool is binary v1. For survival outcomes use time-dependent ROC. And ROC measures discrimination only; a model can have AUC of 0.9 yet emit miscalibrated probabilities that overstate confidence. The Brier score and reliability bins below catch that.
Try a real-world example to load.
A well-separated screening problem. AUC near 0.95.
Read more ROC anatomy and DeLong derivation
Caveats When this is the wrong tool
- If you have…
- Use instead
- Multi-class outcomes (3+ classes)
- This v1 is binary. Compute one-vs-rest ROC per class plus a macro-averaged AUC manually, or pick a one-vs-rest tool. Multi-class is on the roadmap.
- Survival outcomes (time-to-event)
- Use time-dependent ROC (survivalROC, timeROC). The ROC curve becomes a function of horizon t. See the survival analysis tutorial.
- Heavy class imbalance (positives < 5%)
- ROC is misleading because the FPR axis barely moves. Report a precision-recall curve and PR-AUC instead; they show the precision degradation that ROC hides.
- Already have just a confusion matrix
- You can’t reconstruct ROC from a single threshold. Use the confusion matrix interpreter for accuracy / precision / recall / MCC at that one threshold.
- Probabilities matter, not ranking
- AUC is invariant under monotonic transforms; a model can have AUC of 0.95 and be wildly miscalibrated. Always look at Brier and the reliability diagram below the AUC.
- Logistic regression in R, end to end – how predicted probabilities become a ROC curve.
- Confusion matrix interpreter – once you pick a threshold, this tells you which metric to trust.
- Logistic regression diagnostics – Hosmer-Lemeshow, calibration, separation.
Numerical accuracy: AUC computed as Mann-Whitney U with mid-rank ties (matches pROC default). DeLong CI on the logit scale, then back-transformed. Match to pROC::auc() and pROC::ci.auc(method="delong") to 4 decimals.