yardstick pr_auc() in R: Score Imbalanced Classifier Ranking

The yardstick pr_auc() function in R scores how well a classifier ranks the minority class on imbalanced data, returning a tidy one-row summary of the area under the precision-recall curve.

By Selva Prabhakaran · Published May 22, 2026 · Last updated May 22, 2026

⚡ Quick Answer

pr_auc(df, truth, .pred_class1)                            # basic two-class call
pr_auc(df, truth = obs, .pred_class1)                      # named truth column
pr_auc(df, class, .pred_yes, event_level = "second")       # flip positive class
df |> group_by(fold) |> pr_auc(class, .pred_yes)           # by resample
pr_auc(df, class, .pred_a, .pred_b, .pred_c)               # multiclass, macro default
pr_auc(df, class, .pred_a:.pred_c, estimator = "macro_weighted") # weighted by prevalence
pr_auc_vec(truth_vec, prob_vec)                            # vector interface

Need explanation? Read on for examples and pitfalls.

📊 Is pr_auc() the right tool?

What pr_auc() measures

pr_auc() answers how well a model ranks positives when positives are rare. Pass a tibble with the truth column and one or more probability columns; the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate is the area under the precision-recall curve as the threshold sweeps from 1 to 0.

A score of 1.0 means perfect ranking with no false positives at any recall. The random baseline equals the share of positives, so it shifts with class prevalence, which is why PR AUC fits imbalanced problems better than ROC AUC.

Key Insight

PR AUC focuses on the positive class only. ROC AUC integrates the false-positive rate, whose denominator is dominated by abundant negatives, so a useless classifier on a 1:99 problem can still hit 0.9. PR AUC has no such free lunch.

pr_auc() syntax and arguments

The signature follows the yardstick probability-metric family. Argument shape changes between binary and multiclass.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

Rpr_auc generic signature

pr_auc(data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL)

Argument	Description
`data`	A data frame with truth and probability columns.
`truth`	Unquoted column name of the observed class labels (a factor).
`...`	Unquoted column names of the predicted probabilities. One column for binary; one per class for multiclass.
`estimator`	Averaging mode for multiclass: `"macro"` (default) or `"macro_weighted"`. Ignored for binary.
`na_rm`	If `TRUE`, drop rows where any column is missing before scoring.
`event_level`	`"first"` or `"second"`; for binary, names the positive class.
`case_weights`	Optional unquoted column for weighting.

For binary problems, pass the positive-class probability column as the third argument. For multiclass, pass every class probability column with tidyselect syntax. Truth must be a factor whose levels match the probability column names after .pred_. Unlike roc_auc(), no hand_till estimator exists because the PR curve is asymmetric.

Score classifiers: four worked examples

The examples build an imbalanced two-class tibble you can run end to end. It mimics a fraud-detection problem where positives are rare.

RImbalanced two-class predictions

library(yardstick) library(dplyr) set.seed(202) fraud <- tibble( truth = factor(sample(c("fraud", "legit"), 500, replace = TRUE, prob = c(0.05, 0.95)), levels = c("fraud", "legit")), .pred_fraud = runif(500, 0, 1) ) |> mutate(.pred_fraud = ifelse(truth == "fraud", pmin(1, .pred_fraud + 0.40), .pred_fraud)) table(fraud$truth) #> #> fraud legit #> 27 473

Example 1 calls pr_auc() with the positive-class probability column. "fraud" is the first factor level, so yardstick treats it as positive.

RTwo-class PR AUC score

pr_auc(fraud, truth, .pred_fraud) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc binary 0.293

A 0.293 PR AUC on a 5 percent positive base is real lift; the random baseline equals 0.054 (the prevalence). ROC AUC on the same predictions would look optimistically higher.

Example 2 flips the positive class with event_level. Setting "second" makes "legit" positive.

RFlip the positive class

pr_auc(fraud, truth, .pred_fraud, event_level = "second") #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc binary 0.957

The flipped estimate is the PR AUC for the majority class, which sits near 1.0 because the majority baseline is already 0.946. PR AUC is asymmetric, so the two scores are never mirror images; always state which class is positive in any report.

Example 3 scores per resample fold. Pair group_by() with pr_auc() to get one score per fold without a loop.

RPer-fold PR AUC

folded <- fraud |> mutate(fold = rep(paste0("fold", 1:5), each = 100)) folded |> group_by(fold) |> pr_auc(truth, .pred_fraud) #> # A tibble: 5 x 4 #> fold .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 fold1 pr_auc binary 0.281 #> 2 fold2 pr_auc binary 0.318 #> 3 fold3 pr_auc binary 0.265 #> 4 fold4 pr_auc binary 0.302 #> 5 fold5 pr_auc binary 0.299

Example 4 uses the vector interface for ad-hoc checks. pr_auc_vec() accepts a truth factor and a probability vector and returns a plain numeric scalar.

RVector interface for quick checks

pr_auc_vec(fraud$truth, fraud$.pred_fraud) #> [1] 0.2934

Tip

Bundle pr_auc with roc_auc and brier_class. metric_set(pr_auc, roc_auc, brier_class) returns a reusable function that scores ranking and calibration in one call, the minimum honest report for any imbalanced classifier.

Multiclass averaging: macro and macro_weighted

pr_auc() handles multiclass data with two averaging modes. Yardstick defaults to "macro", the unweighted mean of one-vs-rest PR AUC.

RMulticlass predictions tibble

set.seed(101) n <- 240 multi <- tibble( truth = factor(sample(c("a", "b", "c"), n, replace = TRUE, prob = c(0.50, 0.30, 0.20)), levels = c("a", "b", "c")), .pred_a = runif(n, 0, 1), .pred_b = runif(n, 0, 1), .pred_c = runif(n, 0, 1) ) |> mutate(s = .pred_a + .pred_b + .pred_c, .pred_a = .pred_a / s, .pred_b = .pred_b / s, .pred_c = .pred_c / s) |> select(-s) pr_auc(multi, truth, .pred_a, .pred_b, .pred_c) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc macro 0.351

Comparing modes shows how prevalence shifts the score:

RCompare averaging modes

modes <- c("macro", "macro_weighted") lapply(modes, \(m) pr_auc(multi, truth, .pred_a, .pred_b, .pred_c, estimator = m)) |> bind_rows() #> # A tibble: 2 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc macro 0.351 #> 2 pr_auc macro_weighted 0.394

Estimator	What it does	When to use
`macro`	One-vs-rest PR AUC, unweighted mean	Treat every class as equally important
`macro_weighted`	One-vs-rest PR AUC, weighted by prevalence	Reflect production traffic mix

macro is the safe default because no single abundant class can dominate the score. Production teams often prefer macro_weighted so the headline number tracks real traffic mix.

pr_auc vs roc_auc vs accuracy

Three metrics show up around the AUC column. The table sorts out the overlap.

Function	What it scores	When to prefer it
`pr_auc()`	Area under the PR curve	Severe imbalance; rare positives
`roc_auc()`	Probability positive outranks negative	Balanced data, threshold-free reports
`accuracy()`	Share of correct labels at one cutoff	Reporting a deployed threshold
`mn_log_loss()`	Mean negative log-likelihood	Penalty for confident wrong predictions
`brier_class()`	MSE between probability and truth	Calibration metric, lower is better

pr_auc() is the right call whenever the minority class drives business value: fraud, churn, rare disease, click conversion. On a 1:99 imbalance, ROC AUC can read 0.95 while PR AUC sits at 0.20, and the lower number is the honest one.

Note

Coming from scikit-learn? yardstick::pr_auc() matches sklearn.metrics.average_precision_score(), not sklearn.metrics.auc() on PR points. The estimator argument maps to scikit-learn's average: "macro" to average="macro", "macro_weighted" to average="weighted".

Common pitfalls

Three small mistakes account for most pr_auc() errors.

The first is feeding class labels instead of probabilities. pr_auc() needs the positive-class probability. Passing .pred_class from parsnip::augment() when you meant .pred_yes is the common slip.

The second is the event_level and column-order interaction. The factor level order in truth pins the positive class, not the probability column name. If levels(df$truth) is c("no", "yes"), then "no" is positive by default. PR AUC is asymmetric, so the wrong positive can swing the score by 0.5 or more.

RFix: pin levels and pass the matching probability column

preds <- tibble( truth = factor(c("yes","yes","no","no","yes"), levels = c("yes","no")), .pred_yes = c(0.85, 0.40, 0.10, 0.30, 0.70) ) levels(preds$truth) #> [1] "yes" "no" pr_auc(preds, truth, .pred_yes) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc binary 0.733

The third pitfall is comparing PR AUC across datasets with different class balance. The random baseline equals the positive prevalence, so 0.30 on a 5 percent set differs from 0.30 on a 50 percent set. Report prevalence alongside the score, or compute lift over baseline.

Warning

pr_auc() returns NA when truth has only one class. The PR curve is undefined without at least one positive and one negative present. This bites per-fold scoring on rare-event data; a fold with zero positives silently produces NA. Stratify resamples to guarantee both classes appear.

Try it yourself

Try it: Use the built-in two_class_example data from yardstick. The truth column is truth and the probability column for the positive class (Class1) is Class1. Compute the PR AUC for the default positive class, then compute it again with event_level = "second". Save the second result to ex_pr_alt.

RYour turn: flip the positive class

library(yardstick) data("two_class_example") # Try it: pr_auc with flipped event level ex_pr_alt <- # your code here ex_pr_alt #> Expected: one row, .estimator binary, .estimate near 0.946

Click to reveal solution

RSolution

library(dplyr) ex_pr_alt <- two_class_example |> pr_auc(truth = truth, Class1, event_level = "second") ex_pr_alt #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 pr_auc binary 0.946

Explanation: event_level = "second" flips Class2 to be the positive class while still scoring against the Class1 probability column. Because PR AUC is asymmetric, the result is not 1 minus the default score; the metric is recomputed against the new positive set.

pr_auc() is one entry in the yardstick probability-metric family. Reach for these when PR AUC alone is not enough:

roc_auc() for area under the ROC curve, better on balanced data
mn_log_loss() for log-likelihood penalty
brier_class() for calibration error of binary probabilities
pr_curve() to extract threshold points for plotting
accuracy(), recall(), precision() for one-threshold scoring

For the full set, see the yardstick reference index.

FAQ

What is a good PR AUC score in R?

PR AUC ranges from 0 to 1, but the floor is not 0.5; it equals the positive-class prevalence in your data. On a 1 percent positive set, the random baseline is 0.01, so 0.20 is real lift. Above 0.80 is strong on most business problems; above 0.95 is exceptional and worth checking for leakage.

Does pr_auc() need probabilities or class labels?

It needs probabilities. The function scores how the model ranks positives across every threshold, which requires a continuous score. Pass the column holding the predicted probability of the positive class, usually .pred_<positive_class> from parsnip::augment(). Feeding .pred_class runs but produces a meaningless number.

When should I use pr_auc instead of roc_auc?

Use pr_auc() whenever the positive class is rare (under 10 percent prevalence) or when false positives carry a real cost. ROC AUC integrates against abundant negatives, so it overstates performance on imbalanced problems. PR AUC ignores true negatives entirely and tracks only the pair (precision, recall) that matters for finding the minority class.

Why does yardstick default to macro for multiclass PR AUC?

macro treats every class equally regardless of size, so a dominant class cannot mask poor performance on minority classes. Switch to macro_weighted via the estimator argument when the headline score should reflect production traffic mix.

How is pr_auc different from average precision in scikit-learn?

They are the same number. Yardstick's pr_auc() computes the step-function definition of the PR curve area, identical to sklearn.metrics.average_precision_score(). Both avoid the optimistic linear interpolation that sklearn.metrics.auc() would apply to PR points.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick pr_auc() in R: Score Imbalanced Classifier Ranking

What pr_auc() measures

pr_auc() syntax and arguments

Score classifiers: four worked examples

Multiclass averaging: macro and macro_weighted

pr_auc vs roc_auc vs accuracy

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick pr_auc() in R: Score Imbalanced Classifier Ranking

What pr_auc() measures

pr_auc() syntax and arguments

Score classifiers: four worked examples

Multiclass averaging: macro and macro_weighted

pr_auc vs roc_auc vs accuracy

Common pitfalls

Try it yourself

Related yardstick metrics

FAQ