yardstick accuracy() in R: Score Classification Models
The yardstick accuracy() function in R returns the proportion of correctly classified rows for a fitted classification model, accepting a tibble of truth and prediction columns and returning a tidy one-row summary you can group, average, or chart.
accuracy(df, truth, estimate) # basic two-class call accuracy(df, truth = obs, estimate = pred) # named arguments accuracy(df, class, .pred_class) # default parsnip output df |> group_by(fold) |> accuracy(class, .pred_class) # by resample accuracy(df, class, .pred_class, na_rm = TRUE) # drop missing rows accuracy_vec(truth_vec, pred_vec) # vector interface accuracy(df, class, .pred_class, event_level = "second") # flip positive class
Need explanation? Read on for examples and pitfalls.
What accuracy() measures
accuracy() counts how many predictions match the truth and divides by the total rows. You pass a data frame containing both the observed labels and the predicted labels, and the function returns a one-row tibble with three columns: .metric, .estimator, and .estimate. The estimate is a number between 0 and 1 where 1 means every row was classified correctly.
The function works for two-class and multiclass problems out of the box. For multiclass cases it ignores per-class weighting, since accuracy is, by definition, a global score. If you need class-balanced reporting, reach for bal_accuracy() or use precision() and recall() per class.
Accuracy is the right headline metric when classes are roughly balanced and every error costs the same. It becomes misleading the moment one class dominates the data, because a model that always predicts the majority can still score high.
accuracy() syntax and arguments
The signature mirrors every other yardstick class metric. That means the same call shape works whether you score logistic regression, a random forest, or a neural net.
| Argument | Description |
|---|---|
data |
A data frame with the truth and estimate columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
estimate |
Unquoted column name of the predicted class labels (a factor with matching levels). |
na_rm |
If TRUE, drop rows where either column is missing before scoring. |
case_weights |
Optional column of row weights, useful for survey or class-rebalanced data. |
event_level |
"first" or "second"; controls which factor level counts as the positive class for two-class problems. |
Truth and estimate must both be factors with identical levels. The most common error you will hit is passing a character vector or a factor with mismatched levels, both covered in the pitfalls section.
Score classification models: four examples
The examples below build a small two-class data set, then a multiclass one, so you can run accuracy() against both shapes. First, load yardstick and create a simple two-class prediction frame.
Example 1 calls accuracy() with positional arguments. The function picks up the truth and estimate columns by position and returns the tidy summary.
The .estimator column reports binary because yardstick detected exactly two factor levels. For multiclass data it will switch to multiclass.
Example 2 scores a multiclass setup with named arguments. Named arguments are clearer when the truth and estimate columns sit next to other features.
Example 3 groups scoring by resample fold. When you store predictions from cross-validation in a single tibble, group_by() plus accuracy returns one score per group, which makes per-fold comparisons trivial.
Example 4 uses the vector interface for ad-hoc checks. When you do not have a tibble handy, accuracy_vec() takes two factors directly and returns a numeric scalar instead of a one-row tibble.
The vector form is the right choice inside map() calls or unit tests where you want a plain number, not a tibble.
group_by() |> accuracy() |> ggplot() to plot per-fold or per-model scores with no reshaping.accuracy() compared with related metrics
Accuracy is the most intuitive metric, but rarely the best one in isolation. The table below summarizes when to pick another yardstick metric.
| Metric | Best use case | Limitation |
|---|---|---|
accuracy() |
Balanced classes, equal error costs | Misleading on imbalanced data |
bal_accuracy() |
Imbalanced classes, equal error costs | Hides which class drives errors |
precision(), recall() |
Asymmetric costs, fraud, medical screening | Two numbers to track, not one |
f_meas() |
Need a single imbalance-aware score | Equal weighting of precision and recall is a choice, not a default |
roc_auc() |
Compare probability rankings across thresholds | Requires probability predictions, not class labels |
A practical workflow is to report accuracy as the headline number, then add at least one threshold-aware metric like roc_auc() or a class-aware metric like bal_accuracy() whenever the class distribution is skewed.
Common pitfalls
Three small mistakes account for most accuracy() errors. Each one has a clear fix.
The first is passing character vectors instead of factors. yardstick requires factors so it can validate that both columns share the same levels. The fix is one factor() call:
The second pitfall is mismatched factor levels. If truth has levels c("yes", "no") but estimate has levels c("no", "yes"), two-class accuracy still works, but the positive class flips, which changes downstream precision and recall scores. Always set both factors with the same levels argument.
The third pitfall is calling accuracy on regression output. The function rejects numeric inputs, but a silent failure mode is when class probabilities (numeric) get accidentally passed as the estimate. The fix is to use predict(model, type = "class") or pull .pred_class from parsnip::augment() output, not the probability columns.
bal_accuracy() or a confusion matrix when classes are skewed.Try it yourself
Try it: Use the built-in two_class_example data from yardstick. Compute the overall accuracy, then group by the truth column and report accuracy per truth value. Save the per-group result to ex_acc_by_class.
Click to reveal solution
Explanation: Grouping by truth before calling accuracy() is the same as computing recall for each class. yardstick still returns a tidy tibble, so you can pipe it into ggplot() or arrange() without reshaping.
Related yardstick metrics
accuracy() is one entry in the yardstick class-metric family. Reach for these neighbors when accuracy alone is not enough:
bal_accuracy()averages per-class recall, robust to imbalanceprecision()andrecall()for asymmetric error costsf_meas()for a single number combining precision and recallkap()for chance-corrected agreementmcc()for the Matthews correlation coefficientroc_auc()andpr_auc()for probability-based ranking metrics
For the full set, see the yardstick reference index.
FAQ
Why does accuracy() return a tibble instead of a number?
The tidy-data return shape is the defining yardstick convention. Every metric, from accuracy() to roc_auc(), returns the same three columns: .metric, .estimator, and .estimate. That uniformity lets you bind_rows() multiple metric calls or chain group_by() |> metric() without writing reshape code. When you just want the scalar, call accuracy_vec() instead.
Does accuracy() work for multiclass classification?
Yes. yardstick detects the number of factor levels in the truth column and switches the .estimator value between binary and multiclass automatically. The arithmetic is identical: total correct divided by total rows. For multiclass problems you often want bal_accuracy() or per-class precision() and recall() alongside accuracy, since a single multiclass accuracy hides which class the model is failing on.
How do I compute accuracy from a parsnip workflow?
Pipe predict() or augment() output into accuracy. With augment(fit, new_data = test), the output already contains truth and .pred_class columns, so the call becomes augment(fit, test) |> accuracy(class, .pred_class). Replace class with the name of your outcome column.
What is the difference between accuracy() and bal_accuracy()?
accuracy() is a global count of correct predictions. bal_accuracy() is the average of per-class recall (also known as macro recall). On balanced data they return similar numbers; on imbalanced data they can diverge sharply. If 95 percent of rows belong to one class, a model that predicts only that class gets 95 percent accuracy and 50 percent balanced accuracy. Always report both when classes are skewed.
Can I weight rows when computing accuracy?
Yes. Pass a case_weights column when you call accuracy. Common uses are survey weights, class-rebalancing weights from themis, or sample weights from parsnip::fit(). The result is a weighted proportion correct, still bounded between 0 and 1.
Summary
accuracy() is the simplest yardstick classification metric and the easiest to misuse. Use it as a fast sanity check, switch to bal_accuracy() or per-class metrics when classes are imbalanced, and lean on the vector interface only when you do not need the tidy tibble shape. Combined with group_by() it scales to any resampling scheme without reshaping.