yardstick spec() in R: Score True Negative Rate
The yardstick spec() function in R returns the share of actual negatives the classifier correctly rejected, accepting a tibble of truth and prediction columns and returning a tidy one-row summary you can group, average across classes, or pair with sens() in a metric set.
spec(df, truth, estimate) # basic two-class call spec(df, truth = obs, estimate = pred) # named arguments spec(df, class, .pred_class) # default parsnip output df |> group_by(fold) |> spec(class, .pred_class) # by resample spec(df, class, .pred_class, estimator = "macro") # multiclass macro average spec(df, class, .pred_class, event_level = "second") # flip positive class spec_vec(truth_vec, pred_vec) # vector interface
Need explanation? Read on for examples and pitfalls.
What spec() measures
spec() answers a single question: of every real negative case, how many did the model correctly reject? You pass a data frame with observed and predicted labels, and the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate is true negatives over the sum of true negatives and false positives, a number between 0 and 1.
Specificity is the headline metric whenever a false positive is expensive. Spam filters, fraud alerts, and confirmatory drug tests all share the shape where wrongly flagging a clean case costs trust or a needless follow-up. Accuracy hides that asymmetry; spec() exposes it directly.
sens().spec() syntax and arguments
The signature mirrors the rest of the yardstick class-metric family. The same shape works for any classifier whose predictions land in a tibble.
| Argument | Description |
|---|---|
data |
A data frame with truth and estimate columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
estimate |
Unquoted column name of the predicted class labels (a factor with matching levels). |
estimator |
Averaging mode for multiclass: "binary", "macro", "macro_weighted", or "micro". Defaults to "binary" for two-class data and "macro" for multiclass. |
na_rm |
If TRUE, drop rows where either column is missing before scoring. |
event_level |
"first" or "second"; controls which factor level counts as the positive class. The negative class is then whichever level is not positive. |
Both truth and estimate must be factors with identical levels. The common slip is passing class probabilities; spec() needs labels, so feed .pred_class from parsnip::augment(), not a .pred_<level> column.
Score classifiers: four worked examples
The examples below build a two-class frame and a multiclass frame so you can run spec() against both shapes. Start with a small spam-filter tibble where flagging a real email as spam is the costly mistake.
Example 1 calls spec() with positional arguments. Because "spam" is the first factor level, yardstick treats it as positive, so spec() reports the fraction of real ham kept out of the spam folder.
A 0.713 specificity means the filter correctly accepted 71 percent of real emails. The other 29 percent landed in spam by mistake, a false-positive rate users notice fast.
Example 2 flips the positive class with event_level. Setting event_level = "second" makes "ham" positive, so spec() now scores how well the model rejects real spam.
Specificity drops to 0.317 because the model predicts spam for many real ham messages. Always state which class is positive when you report the number.
Example 3 groups scoring by resample fold. Cross-validated predictions pair with group_by() to give one specificity per fold without writing a loop.
Example 4 uses the vector interface for ad-hoc checks. spec_vec() accepts two factors and returns a scalar, handy inside summarise() or a diagnostic.
metric_set(sens, spec, bal_accuracy) returns a reusable function that scores all three at once, the minimum honest report for an imbalanced classifier.Multiclass averaging: macro, weighted, and micro
spec() handles multiclass data, but the answer depends on the averaging rule. yardstick computes specificity per class (treating each in turn as positive) and combines the values.
Scoring under each averaging mode gives a useful comparison:
| Estimator | What it does | When to use |
|---|---|---|
macro |
Unweighted mean of per-class specificity | Treat every class as equally important, regardless of size |
macro_weighted |
Mean weighted by class prevalence in truth | Reflect the class mix in the real population |
micro |
Pool all predictions, then compute one ratio | Match the global true-negative rate across pooled outcomes |
binary |
Score one positive class only | Two-class problems, set by default |
Imbalanced studies often report macro specificity so a rare class counts equally. Production dashboards prefer macro_weighted to track real traffic prevalence.
spec vs sens vs precision
Three confusion-matrix metrics sit around specificity; the right comparison depends on which side of the matrix matters. The table sorts the overlap.
| Function | What it scores | Same as |
|---|---|---|
spec() |
True negatives over actual negatives | Specificity, true negative rate |
sens() |
True positives over actual positives | Sensitivity, recall, true positive rate |
precision() |
True positives over predicted positives | Positive predictive value (PPV) |
npv() |
True negatives over predicted negatives | Negative predictive value |
bal_accuracy() |
Mean of sens and spec | Balanced accuracy |
roc_auc() |
Ranking over (1 - specificity, sensitivity) | Area under the ROC curve |
spec() and sens() are the natural pair: spec scores the negative class, sens scores the positive class, and ROC curves trace their trade-off. Do not confuse spec() with npv(): spec conditions on truth, npv conditions on the prediction.
specificity_score. The equivalent is recall_score(y_true, y_pred, pos_label=<negative class>), or compute it from confusion_matrix(). The estimator argument maps to scikit-learn's average: "macro", "micro", and "macro_weighted" align with "macro", "micro", and "weighted".Common pitfalls
Three small mistakes account for most spec() errors.
The first is reporting high specificity without checking sensitivity. A model that calls nothing positive scores 1.0 specificity and zero sensitivity, looks great, and never raises an alarm. Always pair specificity with sens() before claiming progress.
The second is forgetting to set event_level when the rare class is the second factor level. The order of levels() decides which class is positive, and a flipped order silently inverts the negative class. Confirm with levels(inbox$truth) before scoring.
The third pitfall is treating spec() as the inverse of recall on the same class. Specificity conditions on negative truth; recall on positive truth. They share a denominator only when you flip the positive class.
Try it yourself
Try it: Use the built-in two_class_example data from yardstick. Compute the specificity score with the default positive class, then compute it again with the positive class flipped via event_level = "second". Save the second result to ex_spec_alt.
Click to reveal solution
Explanation: event_level = "second" treats Class2 as positive, so specificity now scores the rejection rate of real Class1 cases. The default "first" would have scored real-Class2 rejection instead.
Related yardstick metrics
spec() is one entry in the yardstick class-metric family. Reach for these when specificity alone is not enough:
sens()for the positive-class catch rate, the natural counterpart to specbal_accuracy()averages sens and spec, robust to imbalancenpv()for predicted-negative correctnessroc_auc()for threshold-free rankings across the sens, spec trade-offconf_mat()to see the full confusion matrix
For the full set, see the yardstick reference index.
FAQ
Is spec() the same as specificity in yardstick?
Yes, spec() and specificity() are aliases that return identical numbers. Both compute true negatives over actual negatives, expose the same event_level and estimator arguments, and ship with vector cousins (spec_vec(), specificity_vec()). The short alias mirrors sens(); the long form reads better in clinical reports.
Why does spec() return a tibble instead of a number?
The tidy return shape is the defining yardstick convention. Every metric returns the same three columns: .metric, .estimator, and .estimate. That uniformity lets you bind_rows() multiple metric calls or chain group_by() |> metric() without reshape code. For the scalar, call spec_vec() instead.
How is spec() different from precision?
precision() asks how many predicted positives were correct (TP / predicted positives). spec() asks how many actual negatives were correctly rejected (TN / actual negatives). They condition on different quantities: precision conditions on the prediction, specificity on the truth. Report both whenever false positives are costly.
Why does spec() return NaN on my multiclass data?
A NaN specificity means at least one class has zero true negatives, so the per-class spec divides by zero. It usually shows up when a class has very small support. Switch to estimator = "macro_weighted" to drop empty classes by weight, or stratify your resample.