yardstick roc_auc() in R: Score Classifier Ranking Quality
The yardstick roc_auc() function in R scores how well a classifier ranks positive cases above negative ones, accepting a tibble of truth labels and predicted probabilities and returning a tidy one-row summary you can group, average across classes, or combine with other metrics in a metric set.
roc_auc(df, truth, .pred_class1) # basic two-class call roc_auc(df, truth = obs, .pred_class1) # named truth column roc_auc(df, class, .pred_yes, event_level = "second") # flip positive class df |> group_by(fold) |> roc_auc(class, .pred_yes) # by resample roc_auc(df, class, .pred_a, .pred_b, .pred_c) # multiclass, all prob cols roc_auc(df, class, .pred_a:.pred_c, estimator = "hand_till") # hand_till averaging roc_auc_vec(truth_vec, prob_vec) # vector interface
Need explanation? Read on for examples and pitfalls.
What roc_auc() measures
roc_auc() answers one question: how well does the model rank positives above negatives across every threshold? You pass a data frame with the truth column and one or more predicted-probability columns, and the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate equals the probability that a random positive scores higher than a random negative.
A score of 1.0 means perfect ranking, 0.5 means random guessing, and below 0.5 means the model is inverted. Because AUC sweeps every threshold, it sidesteps the cutoff debate, which is why leaderboards on imbalanced data lean on it even though it ignores calibration.
roc_auc() with brier_class() if calibration matters.roc_auc() syntax and arguments
The signature follows the yardstick probability-metric family. Argument shape changes slightly between two-class and multiclass because multiclass needs one column per level.
| Argument | Description |
|---|---|
data |
A data frame with truth and probability columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
... |
Unquoted column names of the predicted probabilities. One column for binary; one per class for multiclass. |
estimator |
Averaging mode for multiclass: "hand_till" (default), "macro", or "macro_weighted". Ignored for binary. |
na_rm |
If TRUE, drop rows where any column is missing before scoring. |
event_level |
"first" or "second"; for binary, names the positive class. |
case_weights |
Optional unquoted column for weighted scoring. |
For binary problems, pass the probability column for the positive class as the third argument. For multiclass, pass every class probability column with tidyselect syntax. The truth column must be a factor whose levels match the probability column names after .pred_.
Score classifiers: four worked examples
The examples below build a two-class tibble so you can run roc_auc() end to end. Start with a small frame that mimics a churn-prediction problem.
Example 1 calls roc_auc() with the positive-class probability column. Because "churn" is the first factor level, yardstick treats it as the positive class.
A 0.682 AUC means a random churner scores higher than a random non-churner 68 percent of the time, the headline number for ranking-based decision systems.
Example 2 flips the positive class with event_level. Setting event_level = "second" makes "stay" the positive class, so yardstick uses 1 - .pred_churn internally.
The flipped estimate is the mirror score on the other class. Always state which class is positive in any AUC report; ambiguity here is the most common review comment.
Example 3 scores per resample fold. Cross-validated predictions pair with group_by() to give one AUC per fold without a loop.
Example 4 uses the vector interface for ad-hoc checks. roc_auc_vec() accepts a truth factor and a probability vector and returns a plain numeric scalar.
metric_set(roc_auc, mn_log_loss, brier_class) builds a reusable function to score ranking and calibration together, the minimum honest report for probabilistic classifiers.Multiclass averaging: hand_till, macro, and weighted
roc_auc() handles multiclass data, but the default averaging differs from recall(). Yardstick defaults to hand_till, the Hand-and-Till extension that averages AUC over every ordered class pair.
Scoring under each averaging mode highlights the choice:
| Estimator | What it does | When to use |
|---|---|---|
hand_till |
Average AUC over every ordered class pair | Default; insensitive to class prevalence |
macro |
One-vs-rest AUC per class, unweighted mean | Treat every class as equally important |
macro_weighted |
One-vs-rest AUC per class, weighted by prevalence | Reflect production traffic mix |
hand_till is the academic default because it does not depend on class frequencies. Production teams often prefer macro_weighted so the score tracks real traffic mix.
roc_auc vs pr_auc vs accuracy
Three metrics show up around the AUC column. The table sorts out the overlap.
| Function | What it scores | When to prefer it |
|---|---|---|
roc_auc() |
Probability that positive outranks negative | Balanced data, ranking systems, threshold-free reports |
pr_auc() |
Area under the precision-recall curve | Severe class imbalance; positives are rare |
accuracy() |
Share of labels correct at one threshold | Reporting a deployed cutoff, sanity checks |
mn_log_loss() |
Mean negative log-likelihood | Calibration penalty for confident wrong predictions |
brier_class() |
Mean squared error between probability and truth | Calibration metric, lower is better |
roc_auc() reads optimistic on imbalanced data because the negative class dominates the false-positive denominator. On a 1:99 fraud problem, AUC can hit 0.95 while precision at the chosen cutoff stays below 0.20. Pair it with pr_auc() whenever positives are scarce.
yardstick::roc_auc() is the equivalent of sklearn.metrics.roc_auc_score(). The estimator argument maps onto scikit-learn's multi_class and average parameters; hand_till aligns with multi_class="ovo", macro with multi_class="ovr", average="macro", and macro_weighted with multi_class="ovr", average="weighted".Common pitfalls
Three small mistakes account for most roc_auc() errors.
The first is feeding class labels instead of probabilities. roc_auc() needs the probability of the positive class, not the predicted label. A common slip is passing .pred_class from parsnip::augment() when you meant .pred_yes.
The second is forgetting that event_level and column order interact. The probability column name does not pin the positive class; the factor level order in truth does. If levels(df$truth) reads c("no", "yes"), then "no" is positive by default and you should pass .pred_no, not .pred_yes.
The third pitfall is reporting AUC at face value on severely imbalanced data. A high AUC can hide a useless operating point because the false-positive denominator is dominated by negatives. Switch to pr_auc() whenever the minority class is under 10 percent.
Try it yourself
Try it: Use the built-in two_class_example data from yardstick. The truth column is truth, and the probability column for the positive class (Class1) is Class1. Compute the ROC AUC, then compute it again with the positive class flipped via event_level = "second". Save the second result to ex_auc_alt.
Click to reveal solution
Explanation: event_level = "second" flips Class2 to be the positive class while still scoring against the Class1 probability column. The result equals 1 - default_AUC, the mirror score; the default direction is the one you usually want.
Related yardstick metrics
roc_auc() is one entry in the yardstick probability-metric family. Reach for these when AUC alone is not enough:
pr_auc()for area under the precision-recall curve, better on rare positivesmn_log_loss()for log-likelihood penalty on confident wrong predictionsbrier_class()for calibration error of binary probabilitiesroc_curve()to extract threshold points for plottingaccuracy(),recall(), andprecision()for one-threshold label scoring
For the full set, see the yardstick reference index.
FAQ
Does roc_auc() need probabilities or class labels?
It needs probabilities. The function scores how well your model ranks positives above negatives across every threshold, which requires a continuous score, not a label. Pass the column holding the predicted probability of the positive class, usually .pred_<positive_class> from parsnip::augment(). Feeding .pred_class runs but produces a meaningless number.
What is a good ROC AUC score in R?
AUC ranges from 0 to 1, with 0.5 meaning random guessing. Rough conventions: 0.70 to 0.80 is acceptable for social-science classifiers, 0.80 to 0.90 is good for marketing and medical work, and above 0.90 is excellent but worth checking for leakage. Below 0.50 means the model is inverted; usually flip event_level rather than retrain.
Why does yardstick default to hand_till for multiclass AUC?
The Hand-and-Till estimator averages AUC over every ordered class pair, making the score independent of class prevalence. Macro and macro-weighted alternatives are available via estimator when you want one-vs-rest readings or want to reflect production traffic mix.
How is roc_auc different from accuracy?
accuracy() scores correct labels at one chosen threshold. roc_auc() integrates performance across every threshold and never needs a cutoff. A model that always predicts the majority class hits high accuracy but scores 0.5 AUC because it cannot rank anything. Report both when the positive class is rare.