yardstick pr_auc() in R: Score Imbalanced Classifier Ranking
The yardstick pr_auc() function in R scores how well a classifier ranks the minority class on imbalanced data, returning a tidy one-row summary of the area under the precision-recall curve.
pr_auc(df, truth, .pred_class1) # basic two-class call pr_auc(df, truth = obs, .pred_class1) # named truth column pr_auc(df, class, .pred_yes, event_level = "second") # flip positive class df |> group_by(fold) |> pr_auc(class, .pred_yes) # by resample pr_auc(df, class, .pred_a, .pred_b, .pred_c) # multiclass, macro default pr_auc(df, class, .pred_a:.pred_c, estimator = "macro_weighted") # weighted by prevalence pr_auc_vec(truth_vec, prob_vec) # vector interface
Need explanation? Read on for examples and pitfalls.
What pr_auc() measures
pr_auc() answers how well a model ranks positives when positives are rare. Pass a tibble with the truth column and one or more probability columns; the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate is the area under the precision-recall curve as the threshold sweeps from 1 to 0.
A score of 1.0 means perfect ranking with no false positives at any recall. The random baseline equals the share of positives, so it shifts with class prevalence, which is why PR AUC fits imbalanced problems better than ROC AUC.
pr_auc() syntax and arguments
The signature follows the yardstick probability-metric family. Argument shape changes between binary and multiclass.
| Argument | Description |
|---|---|
data |
A data frame with truth and probability columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
... |
Unquoted column names of the predicted probabilities. One column for binary; one per class for multiclass. |
estimator |
Averaging mode for multiclass: "macro" (default) or "macro_weighted". Ignored for binary. |
na_rm |
If TRUE, drop rows where any column is missing before scoring. |
event_level |
"first" or "second"; for binary, names the positive class. |
case_weights |
Optional unquoted column for weighting. |
For binary problems, pass the positive-class probability column as the third argument. For multiclass, pass every class probability column with tidyselect syntax. Truth must be a factor whose levels match the probability column names after .pred_. Unlike roc_auc(), no hand_till estimator exists because the PR curve is asymmetric.
Score classifiers: four worked examples
The examples build an imbalanced two-class tibble you can run end to end. It mimics a fraud-detection problem where positives are rare.
Example 1 calls pr_auc() with the positive-class probability column. "fraud" is the first factor level, so yardstick treats it as positive.
A 0.293 PR AUC on a 5 percent positive base is real lift; the random baseline equals 0.054 (the prevalence). ROC AUC on the same predictions would look optimistically higher.
Example 2 flips the positive class with event_level. Setting "second" makes "legit" positive.
The flipped estimate is the PR AUC for the majority class, which sits near 1.0 because the majority baseline is already 0.946. PR AUC is asymmetric, so the two scores are never mirror images; always state which class is positive in any report.
Example 3 scores per resample fold. Pair group_by() with pr_auc() to get one score per fold without a loop.
Example 4 uses the vector interface for ad-hoc checks. pr_auc_vec() accepts a truth factor and a probability vector and returns a plain numeric scalar.
metric_set(pr_auc, roc_auc, brier_class) returns a reusable function that scores ranking and calibration in one call, the minimum honest report for any imbalanced classifier.Multiclass averaging: macro and macro_weighted
pr_auc() handles multiclass data with two averaging modes. Yardstick defaults to "macro", the unweighted mean of one-vs-rest PR AUC.
Comparing modes shows how prevalence shifts the score:
| Estimator | What it does | When to use |
|---|---|---|
macro |
One-vs-rest PR AUC, unweighted mean | Treat every class as equally important |
macro_weighted |
One-vs-rest PR AUC, weighted by prevalence | Reflect production traffic mix |
macro is the safe default because no single abundant class can dominate the score. Production teams often prefer macro_weighted so the headline number tracks real traffic mix.
pr_auc vs roc_auc vs accuracy
Three metrics show up around the AUC column. The table sorts out the overlap.
| Function | What it scores | When to prefer it |
|---|---|---|
pr_auc() |
Area under the PR curve | Severe imbalance; rare positives |
roc_auc() |
Probability positive outranks negative | Balanced data, threshold-free reports |
accuracy() |
Share of correct labels at one cutoff | Reporting a deployed threshold |
mn_log_loss() |
Mean negative log-likelihood | Penalty for confident wrong predictions |
brier_class() |
MSE between probability and truth | Calibration metric, lower is better |
pr_auc() is the right call whenever the minority class drives business value: fraud, churn, rare disease, click conversion. On a 1:99 imbalance, ROC AUC can read 0.95 while PR AUC sits at 0.20, and the lower number is the honest one.
yardstick::pr_auc() matches sklearn.metrics.average_precision_score(), not sklearn.metrics.auc() on PR points. The estimator argument maps to scikit-learn's average: "macro" to average="macro", "macro_weighted" to average="weighted".Common pitfalls
Three small mistakes account for most pr_auc() errors.
The first is feeding class labels instead of probabilities. pr_auc() needs the positive-class probability. Passing .pred_class from parsnip::augment() when you meant .pred_yes is the common slip.
The second is the event_level and column-order interaction. The factor level order in truth pins the positive class, not the probability column name. If levels(df$truth) is c("no", "yes"), then "no" is positive by default. PR AUC is asymmetric, so the wrong positive can swing the score by 0.5 or more.
The third pitfall is comparing PR AUC across datasets with different class balance. The random baseline equals the positive prevalence, so 0.30 on a 5 percent set differs from 0.30 on a 50 percent set. Report prevalence alongside the score, or compute lift over baseline.
Try it yourself
Try it: Use the built-in two_class_example data from yardstick. The truth column is truth and the probability column for the positive class (Class1) is Class1. Compute the PR AUC for the default positive class, then compute it again with event_level = "second". Save the second result to ex_pr_alt.
Click to reveal solution
Explanation: event_level = "second" flips Class2 to be the positive class while still scoring against the Class1 probability column. Because PR AUC is asymmetric, the result is not 1 minus the default score; the metric is recomputed against the new positive set.
Related yardstick metrics
pr_auc() is one entry in the yardstick probability-metric family. Reach for these when PR AUC alone is not enough:
roc_auc()for area under the ROC curve, better on balanced datamn_log_loss()for log-likelihood penaltybrier_class()for calibration error of binary probabilitiespr_curve()to extract threshold points for plottingaccuracy(),recall(),precision()for one-threshold scoring
For the full set, see the yardstick reference index.
FAQ
What is a good PR AUC score in R?
PR AUC ranges from 0 to 1, but the floor is not 0.5; it equals the positive-class prevalence in your data. On a 1 percent positive set, the random baseline is 0.01, so 0.20 is real lift. Above 0.80 is strong on most business problems; above 0.95 is exceptional and worth checking for leakage.
Does pr_auc() need probabilities or class labels?
It needs probabilities. The function scores how the model ranks positives across every threshold, which requires a continuous score. Pass the column holding the predicted probability of the positive class, usually .pred_<positive_class> from parsnip::augment(). Feeding .pred_class runs but produces a meaningless number.
When should I use pr_auc instead of roc_auc?
Use pr_auc() whenever the positive class is rare (under 10 percent prevalence) or when false positives carry a real cost. ROC AUC integrates against abundant negatives, so it overstates performance on imbalanced problems. PR AUC ignores true negatives entirely and tracks only the pair (precision, recall) that matters for finding the minority class.
Why does yardstick default to macro for multiclass PR AUC?
macro treats every class equally regardless of size, so a dominant class cannot mask poor performance on minority classes. Switch to macro_weighted via the estimator argument when the headline score should reflect production traffic mix.
How is pr_auc different from average precision in scikit-learn?
They are the same number. Yardstick's pr_auc() computes the step-function definition of the PR curve area, identical to sklearn.metrics.average_precision_score(). Both avoid the optimistic linear interpolation that sklearn.metrics.auc() would apply to PR points.