What does AUC actually measure?

AUC (area under the ROC curve) is the probability that a randomly chosen positive example has a higher predicted score than a randomly chosen negative example. AUC = 0.5 is random; 1.0 is perfect; 0.7-0.8 is fair, 0.8-0.9 is good. AUC is threshold-independent, it summarizes performance across every possible threshold.

How do I choose a classification threshold from an ROC curve?

Pick the threshold that matches your cost function. Equal-cost errors: use Youden's J (max sensitivity + specificity - 1). Asymmetric costs: shift toward fewer false negatives (cancer screening) or fewer false positives (spam filter). The calculator shows the cost-weighted optimum threshold for any false-positive vs false-negative cost ratio.

When does a high AUC not mean a good classifier?

AUC can be misleading on imbalanced data, a model can score well by ranking but still misclassify the minority class at any usable threshold. Look at precision-recall AUC instead. Also, AUC ignores calibration: two models with the same AUC may give very different probability estimates.

ROC / AUC Calculator

An ROC curve traces the trade-off between true-positive and false-positive rates as you sweep a classifier's threshold; the AUC summarizes overall discrimination. Paste outcomes plus predicted scores to get the curve, AUC with DeLong CI, three optimal thresholds (Youden, F1, cost-weighted), and a calibration check.

AUC + DeLong CI · Three optimal thresholds · Calibration check · Runs in your browser

Try a real-world example to load.

🔬 Breast cancer (good)

A well-separated screening problem. AUC near 0.95.

Output

AUC (95% CI)

–

paste data on the left

Youden J

–

sens / spec

–

prec / rec

Cost-weighted

–

FP + k FN

R code RUNNABLE

R Reproduce in R

ROC curve INTERACTIVE

FPR (x) vs TPR (y); diagonal = random

paste data to draw the curve

score distribution by outcome

tthreshold 0.50

CHECK Calibration

paste data to compute Brier score and reliability bins.

Inference

We swept the threshold across all values to compute sensitivity and specificity at every cutoff, summarized as the area under the ROC curve.

Read more ROC anatomy and DeLong derivation

TPR = TP / (TP + FN) FPR = FP / (FP + TN) ROC = { (FPR(t), TPR(t)) : t in scores }

Sweeping the threshold. Sort cases by score, descending. As you walk down, every positive you encounter steps the curve up (TPR increases by 1/n_pos), every negative steps it right (FPR increases by 1/n_neg). The result is a staircase from (0,0) to (1,1). Tied scores give diagonal segments; the area under the staircase is the AUC.

AUC = U / (n_pos * n_neg) U = sum over (i,j) of [score_i > score_j] + 0.5 * [score_i = score_j] i positive, j negative

AUC equals P(score+ > score-). The Mann-Whitney U statistic counts pairs where a positive scores higher than a negative, with ties counted as half. Dividing by the number of such pairs (n_pos times n_neg) gives a probability. So AUC = 0.83 literally means: pick one positive and one negative at random; with probability 0.83 the positive scores higher. This makes AUC scale-free (any monotonic transform of score gives the same AUC).

V_10[i] = (1/n_neg) * sum_j psi(X_i, Y_j) i positive V_01[j] = (1/n_pos) * sum_i psi(X_i, Y_j) j negative psi(x, y) = 1 if x > y, 0.5 if equal, 0 otherwise SE(AUC) = sqrt( var(V_10)/n_pos + var(V_01)/n_neg )

DeLong’s structural components. DeLong, DeLong & Clarke-Pearson (1988) showed that AUC is the mean of structural components V_10 (one per positive) and V_01 (one per negative), and that its variance can be estimated nonparametrically from the sample variances of those components. This tool computes V_10 and V_01 directly, then forms the CI on the logit scale (log(AUC/(1-AUC))) and back-transforms, which keeps the interval inside [0,1] for AUC near 1.

Youden: argmax_t TPR(t) + (1 - FPR(t)) - 1 F1: argmax_t 2 P(t) R(t) / (P(t) + R(t)) Cost: argmin_t FP(t) + k * FN(t)

Three optima, three answers. Youden picks the point of maximum vertical distance above the diagonal; it weights sensitivity and specificity equally. F1 weights precision and recall equally and is the right choice when classes are balanced and you care about retrieval. Cost-weighted lets you set the relative cost of a missed positive (k=10 means a missed positive is ten times worse than a false alarm). When the three disagree, the disagreement tells you that costs and prevalence matter for your decision.

Brier = mean( (score - y)^2 ) calibration: bin scores into deciles, plot mean(score) vs mean(y) per bin

Discrimination is not calibration. AUC tells you whether high-risk cases score higher than low-risk ones, but says nothing about whether a predicted probability of 0.8 actually corresponds to an 80% event rate. The Brier score combines both (lower is better, max 0.25 for a binary outcome). Reliability bins make miscalibration visible: if predicted 0.8-0.9 cases have empirical rate 0.5, your model is overconfident in that range.

Caveats When this is the wrong tool

If you have…: Use instead
Multi-class outcomes (3+ classes): This v1 is binary. Compute one-vs-rest ROC per class plus a macro-averaged AUC manually, or pick a one-vs-rest tool. Multi-class is on the roadmap.
Survival outcomes (time-to-event): Use time-dependent ROC (survivalROC, timeROC). The ROC curve becomes a function of horizon t. See the survival analysis tutorial.
Heavy class imbalance (positives < 5%): ROC is misleading because the FPR axis barely moves. Report a precision-recall curve and PR-AUC instead; they show the precision degradation that ROC hides.
Already have just a confusion matrix: You can’t reconstruct ROC from a single threshold. Use the confusion matrix interpreter for accuracy / precision / recall / MCC at that one threshold.
Probabilities matter, not ranking: AUC is invariant under monotonic transforms; a model can have AUC of 0.95 and be wildly miscalibrated. Always look at Brier and the reliability diagram below the AUC.

ROC / AUC Calculator

How we got there