yardstick roc_auc() in R: Score Classifier Ranking Quality

The yardstick roc_auc() function in R scores how well a classifier ranks positive cases above negative ones, accepting a tibble of truth labels and predicted probabilities and returning a tidy one-row summary you can group, average across classes, or combine with other metrics in a metric set.

By Selva Prabhakaran · Published May 22, 2026 · Last updated May 22, 2026

⚡ Quick Answer

roc_auc(df, truth, .pred_class1)                            # basic two-class call
roc_auc(df, truth = obs, .pred_class1)                      # named truth column
roc_auc(df, class, .pred_yes, event_level = "second")       # flip positive class
df |> group_by(fold) |> roc_auc(class, .pred_yes)           # by resample
roc_auc(df, class, .pred_a, .pred_b, .pred_c)               # multiclass, all prob cols
roc_auc(df, class, .pred_a:.pred_c, estimator = "hand_till") # hand_till averaging
roc_auc_vec(truth_vec, prob_vec)                            # vector interface

Need explanation? Read on for examples and pitfalls.

📊 Is roc_auc() the right tool?

What roc_auc() measures

roc_auc() answers one question: how well does the model rank positives above negatives across every threshold? You pass a data frame with the truth column and one or more predicted-probability columns, and the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate equals the probability that a random positive scores higher than a random negative.

A score of 1.0 means perfect ranking, 0.5 means random guessing, and below 0.5 means the model is inverted. Because AUC sweeps every threshold, it sidesteps the cutoff debate, which is why leaderboards on imbalanced data lean on it even though it ignores calibration.

Key Insight

AUC measures ranking, not probability quality. A model that predicts 0.999 for every positive and 0.998 for every negative scores AUC 1.0 yet outputs nearly identical numbers. Pair roc_auc() with brier_class() if calibration matters.

roc_auc() syntax and arguments

The signature follows the yardstick probability-metric family. Argument shape changes slightly between two-class and multiclass because multiclass needs one column per level.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

Rroc_auc generic signature

roc_auc(data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, options = list())

Argument	Description
`data`	A data frame with truth and probability columns.
`truth`	Unquoted column name of the observed class labels (a factor).
`...`	Unquoted column names of the predicted probabilities. One column for binary; one per class for multiclass.
`estimator`	Averaging mode for multiclass: `"hand_till"` (default), `"macro"`, or `"macro_weighted"`. Ignored for binary.
`na_rm`	If `TRUE`, drop rows where any column is missing before scoring.
`event_level`	`"first"` or `"second"`; for binary, names the positive class.
`case_weights`	Optional unquoted column for weighted scoring.

For binary problems, pass the probability column for the positive class as the third argument. For multiclass, pass every class probability column with tidyselect syntax. The truth column must be a factor whose levels match the probability column names after .pred_.

Score classifiers: four worked examples

The examples below build a two-class tibble so you can run roc_auc() end to end. Start with a small frame that mimics a churn-prediction problem.

RTwo-class predictions with probabilities

library(yardstick) library(dplyr) set.seed(404) churn <- tibble( truth = factor(sample(c("churn", "stay"), 200, replace = TRUE, prob = c(0.30, 0.70)), levels = c("churn", "stay")), .pred_churn = runif(200, 0, 1) ) |> mutate(.pred_churn = ifelse(truth == "churn", pmin(1, .pred_churn + 0.25), .pred_churn)) head(churn, 4) #> # A tibble: 4 x 2 #> truth .pred_churn #> <fct> <dbl> #> 1 stay 0.770 #> 2 stay 0.882 #> 3 stay 0.249 #> 4 churn 0.952

Example 1 calls roc_auc() with the positive-class probability column. Because "churn" is the first factor level, yardstick treats it as the positive class.

RTwo-class AUC score

roc_auc(churn, truth, .pred_churn) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.682

A 0.682 AUC means a random churner scores higher than a random non-churner 68 percent of the time, the headline number for ranking-based decision systems.

Example 2 flips the positive class with event_level. Setting event_level = "second" makes "stay" the positive class, so yardstick uses 1 - .pred_churn internally.

RFlip the positive class

roc_auc(churn, truth, .pred_churn, event_level = "second") #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.318

The flipped estimate is the mirror score on the other class. Always state which class is positive in any AUC report; ambiguity here is the most common review comment.

Example 3 scores per resample fold. Cross-validated predictions pair with group_by() to give one AUC per fold without a loop.

RPer-fold AUC

folded <- churn |> mutate(fold = rep(paste0("fold", 1:5), each = 40)) folded |> group_by(fold) |> roc_auc(truth, .pred_churn) #> # A tibble: 5 x 4 #> fold .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 fold1 roc_auc binary 0.665 #> 2 fold2 roc_auc binary 0.716 #> 3 fold3 roc_auc binary 0.654 #> 4 fold4 roc_auc binary 0.704 #> 5 fold5 roc_auc binary 0.671

Example 4 uses the vector interface for ad-hoc checks. roc_auc_vec() accepts a truth factor and a probability vector and returns a plain numeric scalar.

RVector interface for quick checks

roc_auc_vec(churn$truth, churn$.pred_churn) #> [1] 0.6824

Tip

Bundle roc_auc with mn_log_loss and brier_class. metric_set(roc_auc, mn_log_loss, brier_class) builds a reusable function to score ranking and calibration together, the minimum honest report for probabilistic classifiers.

Multiclass averaging: hand_till, macro, and weighted

roc_auc() handles multiclass data, but the default averaging differs from recall(). Yardstick defaults to hand_till, the Hand-and-Till extension that averages AUC over every ordered class pair.

RMulticlass predictions tibble

set.seed(303) n <- 180 multi <- tibble( truth = factor(sample(c("a", "b", "c"), n, replace = TRUE, prob = c(0.5, 0.3, 0.2)), levels = c("a", "b", "c")), .pred_a = runif(n, 0, 1), .pred_b = runif(n, 0, 1), .pred_c = runif(n, 0, 1) ) |> mutate(s = .pred_a + .pred_b + .pred_c, .pred_a = .pred_a / s, .pred_b = .pred_b / s, .pred_c = .pred_c / s) |> select(-s) roc_auc(multi, truth, .pred_a, .pred_b, .pred_c) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc hand_till 0.514

Scoring under each averaging mode highlights the choice:

RCompare averaging modes

modes <- c("hand_till", "macro", "macro_weighted") lapply(modes, \(m) roc_auc(multi, truth, .pred_a, .pred_b, .pred_c, estimator = m)) |> bind_rows() #> # A tibble: 3 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc hand_till 0.514 #> 2 roc_auc macro 0.509 #> 3 roc_auc macro_weighted 0.511

Estimator	What it does	When to use
`hand_till`	Average AUC over every ordered class pair	Default; insensitive to class prevalence
`macro`	One-vs-rest AUC per class, unweighted mean	Treat every class as equally important
`macro_weighted`	One-vs-rest AUC per class, weighted by prevalence	Reflect production traffic mix

hand_till is the academic default because it does not depend on class frequencies. Production teams often prefer macro_weighted so the score tracks real traffic mix.

roc_auc vs pr_auc vs accuracy

Three metrics show up around the AUC column. The table sorts out the overlap.

Function	What it scores	When to prefer it
`roc_auc()`	Probability that positive outranks negative	Balanced data, ranking systems, threshold-free reports
`pr_auc()`	Area under the precision-recall curve	Severe class imbalance; positives are rare
`accuracy()`	Share of labels correct at one threshold	Reporting a deployed cutoff, sanity checks
`mn_log_loss()`	Mean negative log-likelihood	Calibration penalty for confident wrong predictions
`brier_class()`	Mean squared error between probability and truth	Calibration metric, lower is better

roc_auc() reads optimistic on imbalanced data because the negative class dominates the false-positive denominator. On a 1:99 fraud problem, AUC can hit 0.95 while precision at the chosen cutoff stays below 0.20. Pair it with pr_auc() whenever positives are scarce.

Note

Coming from scikit-learn? yardstick::roc_auc() is the equivalent of sklearn.metrics.roc_auc_score(). The estimator argument maps onto scikit-learn's multi_class and average parameters; hand_till aligns with multi_class="ovo", macro with multi_class="ovr", average="macro", and macro_weighted with multi_class="ovr", average="weighted".

Common pitfalls

Three small mistakes account for most roc_auc() errors.

The first is feeding class labels instead of probabilities. roc_auc() needs the probability of the positive class, not the predicted label. A common slip is passing .pred_class from parsnip::augment() when you meant .pred_yes.

The second is forgetting that event_level and column order interact. The probability column name does not pin the positive class; the factor level order in truth does. If levels(df$truth) reads c("no", "yes"), then "no" is positive by default and you should pass .pred_no, not .pred_yes.

RFix: pin levels and pass the matching probability column

preds <- tibble( truth = factor(c("yes","yes","no","no","yes"), levels = c("yes","no")), .pred_yes = c(0.85, 0.40, 0.10, 0.30, 0.70) ) levels(preds$truth) #> [1] "yes" "no" roc_auc(preds, truth, .pred_yes) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.833

The third pitfall is reporting AUC at face value on severely imbalanced data. A high AUC can hide a useless operating point because the false-positive denominator is dominated by negatives. Switch to pr_auc() whenever the minority class is under 10 percent.

Warning

roc_auc() does not validate that probability columns sum to 1. Multiclass probabilities are normalized only if your model produces calibrated output; if you stack three uncalibrated columns the AUC still computes but the comparison loses its probabilistic meaning. Always normalize before scoring multiclass.

Try it yourself

Try it: Use the built-in two_class_example data from yardstick. The truth column is truth, and the probability column for the positive class (Class1) is Class1. Compute the ROC AUC, then compute it again with the positive class flipped via event_level = "second". Save the second result to ex_auc_alt.

RYour turn: flip the positive class

library(yardstick) data("two_class_example") # Try it: roc_auc with flipped event level ex_auc_alt <- # your code here ex_auc_alt #> Expected: one row, .estimator binary, .estimate near 0.061

Click to reveal solution

RSolution

library(dplyr) ex_auc_alt <- two_class_example |> roc_auc(truth = truth, Class1, event_level = "second") ex_auc_alt #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 roc_auc binary 0.0611

Explanation: event_level = "second" flips Class2 to be the positive class while still scoring against the Class1 probability column. The result equals 1 - default_AUC, the mirror score; the default direction is the one you usually want.

roc_auc() is one entry in the yardstick probability-metric family. Reach for these when AUC alone is not enough:

pr_auc() for area under the precision-recall curve, better on rare positives
mn_log_loss() for log-likelihood penalty on confident wrong predictions
brier_class() for calibration error of binary probabilities
roc_curve() to extract threshold points for plotting
accuracy(), recall(), and precision() for one-threshold label scoring

For the full set, see the yardstick reference index.

FAQ

Does roc_auc() need probabilities or class labels?

It needs probabilities. The function scores how well your model ranks positives above negatives across every threshold, which requires a continuous score, not a label. Pass the column holding the predicted probability of the positive class, usually .pred_<positive_class> from parsnip::augment(). Feeding .pred_class runs but produces a meaningless number.

What is a good ROC AUC score in R?

AUC ranges from 0 to 1, with 0.5 meaning random guessing. Rough conventions: 0.70 to 0.80 is acceptable for social-science classifiers, 0.80 to 0.90 is good for marketing and medical work, and above 0.90 is excellent but worth checking for leakage. Below 0.50 means the model is inverted; usually flip event_level rather than retrain.

Why does yardstick default to hand_till for multiclass AUC?

The Hand-and-Till estimator averages AUC over every ordered class pair, making the score independent of class prevalence. Macro and macro-weighted alternatives are available via estimator when you want one-vs-rest readings or want to reflect production traffic mix.

How is roc_auc different from accuracy?

accuracy() scores correct labels at one chosen threshold. roc_auc() integrates performance across every threshold and never needs a cutoff. A model that always predicts the majority class hits high accuracy but scores 0.5 AUC because it cannot rank anything. Report both when the positive class is rare.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick roc_auc() in R: Score Classifier Ranking Quality

What roc_auc() measures

roc_auc() syntax and arguments

Score classifiers: four worked examples

Multiclass averaging: hand_till, macro, and weighted

roc_auc vs pr_auc vs accuracy

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick roc_auc() in R: Score Classifier Ranking Quality

What roc_auc() measures

roc_auc() syntax and arguments

Score classifiers: four worked examples

Multiclass averaging: hand_till, macro, and weighted

roc_auc vs pr_auc vs accuracy

Common pitfalls

Try it yourself

Related yardstick metrics

FAQ