Rr‑statistics.co

ROC / AUC Calculator

An ROC curve traces the trade-off between true-positive and false-positive rates as you sweep a classifier's threshold; the AUC summarizes overall discrimination. Paste outcomes plus predicted scores to get the curve, AUC with DeLong CI, three optimal thresholds (Youden, F1, cost-weighted), and a calibration check.

i New to ROC and AUC? Read the 4-min primer

What ROC and AUC mean. A binary classifier produces a score for each case. As you sweep the decision threshold from high to low, you trade specificity for sensitivity. The ROC curve plots true-positive rate (sensitivity) against false-positive rate (1 - specificity) at every threshold. AUC is the area under that curve and equals the probability that a randomly drawn positive case scores higher than a randomly drawn negative one. AUC of 0.5 is random; 1.0 is perfect; 0 is perfectly wrong.

How to read the curve and the threshold. The top-left corner is perfect. The diagonal is random. A curve that hugs the top-left has good discrimination. Each point on the curve corresponds to a specific threshold; sliding the threshold trades false alarms for misses. The threshold marker on the curve shows where you currently sit, and the confusion matrix updates so you can see exactly how many TPs, FPs, FNs, and TNs that threshold produces.

Picking the right optimum. Youden’s J maximises sensitivity + specificity - 1, which is the same as the point on the curve farthest above the diagonal. F1 maximises 2 P R / (P + R), which weighs precision and recall equally and is preferred when classes are balanced and you want a single score. Cost-weighted minimises FP cost_FP + FN cost_FN, which is the right answer when you can put a number on a missed positive vs. a false alarm. They usually disagree, and that disagreement is information about your problem.

When ROC is the wrong tool. On heavily imbalanced data (positives below 5%), ROC flatters you because changes in FP count barely move the FPR; report a precision-recall curve too. For multi-class problems, ROC is one-vs-rest per class plus a macro average; this tool is binary v1. For survival outcomes use time-dependent ROC. And ROC measures discrimination only; a model can have AUC of 0.9 yet emit miscalibrated probabilities that overstate confidence. The Brier score and reliability bins below catch that.

AUC + DeLong CI · Three optimal thresholds · Calibration check · Runs in your browser

Try a real-world example to load.

🔬 Breast cancer (good)

A well-separated screening problem. AUC near 0.95.

AUC (95% CI)
paste data on the left
Youden J
sens / spec
F1
prec / rec
Cost-weighted
FP + k FN
R code RUNNABLE
R Reproduce in R

        
ROC curve INTERACTIVE
FPR (x) vs TPR (y); diagonal = random
paste data to draw the curve
score distribution by outcome
0.50
CHECK Calibration
paste data to compute Brier score and reliability bins.
Inference

Read more ROC anatomy and DeLong derivation
TPR = TP / (TP + FN) FPR = FP / (FP + TN) ROC = { (FPR(t), TPR(t)) : t in scores }
Sweeping the threshold. Sort cases by score, descending. As you walk down, every positive you encounter steps the curve up (TPR increases by 1/n_pos), every negative steps it right (FPR increases by 1/n_neg). The result is a staircase from (0,0) to (1,1). Tied scores give diagonal segments; the area under the staircase is the AUC.
AUC = U / (n_pos * n_neg) U = sum over (i,j) of [score_i > score_j] + 0.5 * [score_i = score_j] i positive, j negative
AUC equals P(score+ > score-). The Mann-Whitney U statistic counts pairs where a positive scores higher than a negative, with ties counted as half. Dividing by the number of such pairs (n_pos times n_neg) gives a probability. So AUC = 0.83 literally means: pick one positive and one negative at random; with probability 0.83 the positive scores higher. This makes AUC scale-free (any monotonic transform of score gives the same AUC).
V_10[i] = (1/n_neg) * sum_j psi(X_i, Y_j) i positive V_01[j] = (1/n_pos) * sum_i psi(X_i, Y_j) j negative psi(x, y) = 1 if x > y, 0.5 if equal, 0 otherwise SE(AUC) = sqrt( var(V_10)/n_pos + var(V_01)/n_neg )
DeLong’s structural components. DeLong, DeLong & Clarke-Pearson (1988) showed that AUC is the mean of structural components V_10 (one per positive) and V_01 (one per negative), and that its variance can be estimated nonparametrically from the sample variances of those components. This tool computes V_10 and V_01 directly, then forms the CI on the logit scale (log(AUC/(1-AUC))) and back-transforms, which keeps the interval inside [0,1] for AUC near 1.
Youden: argmax_t TPR(t) + (1 - FPR(t)) - 1 F1: argmax_t 2 P(t) R(t) / (P(t) + R(t)) Cost: argmin_t FP(t) + k * FN(t)
Three optima, three answers. Youden picks the point of maximum vertical distance above the diagonal; it weights sensitivity and specificity equally. F1 weights precision and recall equally and is the right choice when classes are balanced and you care about retrieval. Cost-weighted lets you set the relative cost of a missed positive (k=10 means a missed positive is ten times worse than a false alarm). When the three disagree, the disagreement tells you that costs and prevalence matter for your decision.
Brier = mean( (score - y)^2 ) calibration: bin scores into deciles, plot mean(score) vs mean(y) per bin
Discrimination is not calibration. AUC tells you whether high-risk cases score higher than low-risk ones, but says nothing about whether a predicted probability of 0.8 actually corresponds to an 80% event rate. The Brier score combines both (lower is better, max 0.25 for a binary outcome). Reliability bins make miscalibration visible: if predicted 0.8-0.9 cases have empirical rate 0.5, your model is overconfident in that range.
Caveats When this is the wrong tool
If you have…
Use instead
Multi-class outcomes (3+ classes)
This v1 is binary. Compute one-vs-rest ROC per class plus a macro-averaged AUC manually, or pick a one-vs-rest tool. Multi-class is on the roadmap.
Survival outcomes (time-to-event)
Use time-dependent ROC (survivalROC, timeROC). The ROC curve becomes a function of horizon t. See the survival analysis tutorial.
Heavy class imbalance (positives < 5%)
ROC is misleading because the FPR axis barely moves. Report a precision-recall curve and PR-AUC instead; they show the precision degradation that ROC hides.
Already have just a confusion matrix
You can’t reconstruct ROC from a single threshold. Use the confusion matrix interpreter for accuracy / precision / recall / MCC at that one threshold.
Probabilities matter, not ranking
AUC is invariant under monotonic transforms; a model can have AUC of 0.95 and be wildly miscalibrated. Always look at Brier and the reliability diagram below the AUC.
Further reading

Numerical accuracy: AUC computed as Mann-Whitney U with mid-rank ties (matches pROC default). DeLong CI on the logit scale, then back-transformed. Match to pROC::auc() and pROC::ci.auc(method="delong") to 4 decimals.