yardstick f_meas() in R: Score the F-Measure of Classifiers
The yardstick f_meas() function in R computes the F-measure of a classifier, a single score that combines precision and recall through their harmonic mean. It accepts a tibble of truth and prediction columns, supports beta tuning for asymmetric costs, and returns a tidy one-row summary you can group, average across classes, or feed into a metric set.
f_meas(df, truth, estimate) # basic F1 score f_meas(df, truth, estimate, beta = 2) # F2, weights recall f_meas(df, truth, estimate, beta = 0.5) # F0.5, weights precision f_meas(df, class, .pred_class) # default parsnip output df |> group_by(fold) |> f_meas(class, .pred_class) # by resample f_meas(df, class, .pred_class, estimator = "macro") # multiclass macro F1 f_meas(df, class, .pred_class, event_level = "second") # flip positive class f_meas_vec(truth_vec, pred_vec) # vector interface
Need explanation? Read on for examples and pitfalls.
What f_meas() measures
f_meas() answers one question: how well does the model balance catching positives and avoiding false alarms? You pass a data frame with observed labels and predicted labels, and the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate is the harmonic mean of precision and recall, a number between 0 and 1.
The harmonic mean punishes extremes. A model with 1.0 precision and 0.01 recall scores 0.02, not 0.5. That property makes the F-measure useful whenever a high score requires the model to be good at both jobs at once.
F1 is the default and weights precision and recall equally. The beta argument tunes that weighting: values above 1 favor recall, values below 1 favor precision. Pick beta by which kind of error costs more.
f_meas() syntax and arguments
The signature matches the rest of the yardstick class-metric family. The same call works for logistic regression, random forest, xgboost, or any classifier whose predictions land in a tibble.
| Argument | Description |
|---|---|
data |
Data frame with truth and estimate columns. |
truth |
Unquoted column of observed class labels (factor). |
estimate |
Unquoted column of predicted class labels (factor, same levels). |
beta |
Recall weight relative to precision. 1 = F1; >1 favors recall; <1 favors precision. |
estimator |
Multiclass averaging: "binary", "macro", "macro_weighted", "micro". Defaults to "binary" for two-class data, "macro" for multiclass. |
na_rm |
If TRUE, drop rows where either column is missing. |
event_level |
"first" or "second"; picks which factor level is positive. |
Both columns must be factors with identical levels. The most common mistake is passing class probabilities to estimate; f_meas() needs class labels, so feed .pred_class from parsnip::augment().
Score classifiers: five worked examples
The examples use yardstick's built-in two_class_example data, then a multiclass tibble. Load yardstick and inspect the bundled predictions first.
Example 1 calls f_meas() with positional arguments. "Class1" is the first factor level, so yardstick treats it as the positive class by default.
An F1 of 0.849 sits between this classifier's precision (0.819) and recall (0.880), as the harmonic mean always will.
Example 2 tunes the recall-precision balance with beta. Switching from F1 to F2 raises the weight on recall.
A higher beta pulls the score toward recall (0.880). Use F2 for cancer screening, fraud, or any setting where missing a positive is costly.
Example 3 flips the positive class with event_level. Setting event_level = "second" makes "Class2" the positive class.
The score barely moves because the classes are roughly balanced; on imbalanced data the swing is dramatic.
Example 4 groups scoring by resample fold. Cross-validated predictions in one tibble pair with group_by() to give one F1 per group.
Example 5 uses the vector interface. f_meas_vec() accepts two factors and returns a numeric scalar instead of a tibble.
metric_set(precision, recall, f_meas) builds a reusable function you pipe predictions into to get all three scores at once, so you never have to read F1 in isolation.Multiclass averaging: macro, weighted, and micro
f_meas() handles multiclass data, but the answer depends on the averaging rule.
hpc_cv has four imbalanced classes (VF, F, M, L). Score it under each averaging mode.
| Estimator | What it does | When to use |
|---|---|---|
macro |
Unweighted mean of per-class F1 | Every class equally important regardless of size |
macro_weighted |
Mean weighted by class prevalence | Reflect the real population class mix |
micro |
Pool all predictions, compute one ratio | Track the global hit rate |
binary |
Score one positive class only | Two-class problems (default) |
The 0.18 gap between macro and macro_weighted is the imbalance signature: small classes (L, M) score lower than the large class (VF), so the unweighted mean drags the headline down. Choose macro when small classes matter to your stakeholders.
f_meas compared with related metrics
The F-measure is one slot in the classification scorecard. Reach for each yardstick metric where it fits.
| Metric | Best use case | Limitation |
|---|---|---|
f_meas() |
One imbalance-aware score for precision + recall | Hides the trade-off |
precision() |
Costly false positives (spam, fraud flags) | Ignores missed positives |
recall() |
Costly false negatives (cancer screening) | Ignores wasted positives |
accuracy() |
Balanced classes, equal error costs | Misleading on imbalanced data |
bal_accuracy() |
Mean per-class recall, robust to imbalance | Ignores precision |
pr_auc() |
Threshold-free ranking under heavy imbalance | Needs probabilities, not class labels |
A practical workflow reports precision, recall, and f_meas() together. Add pr_auc() whenever the threshold is still tunable; F-measure assumes it is fixed.
yardstick::f_meas() equals sklearn.metrics.f1_score() at beta = 1 and fbeta_score() for other betas. The estimator argument maps to scikit-learn's average: "macro", "micro", "macro_weighted" line up with "macro", "micro", "weighted".Common pitfalls
Three small mistakes account for most f_meas() errors.
The first is feeding probabilities instead of class labels. A probability cast to a factor produces a meaningless score. Always pull .pred_class from augment(), not .pred_<level>.
The second pitfall is forgetting that beta is asymmetric. beta = 2 does not double the recall component; it raises the weight on recall by a factor of beta^2 = 4. Read the F-beta definition before tuning.
The third pitfall is reporting macro F1 without checking for empty classes. If the model never predicts a class, the macro average drops to NaN. Switch to estimator = "macro_weighted" so empty classes drop by weight, or inspect conf_mat() to see what is missing.
Try it yourself
Try it: Use the built-in two_class_example data from yardstick. Compute F1 (the default), then compute F2 to put more weight on recall. Save the F2 result to ex_f2.
Click to reveal solution
Explanation: beta = 2 weights recall four times as heavily as precision (beta^2 sets the ratio). Because this classifier already recalls more positives than it precisely predicts, F2 rises above F1.
Related yardstick metrics
f_meas() is one entry in the yardstick class-metric family. Reach for these neighbors when F-measure alone is not enough:
precision()andrecall()for the two componentsaccuracy()andbal_accuracy()for overall hit ratekap()andmcc()for chance-corrected agreementpr_auc()androc_auc()for threshold-free probability rankingsconf_mat()to see the full confusion matrix
See the yardstick reference index for the full set.
FAQ
What is the difference between f_meas() and the F1 score?
f_meas() is the general F-beta function; F1 is the specific case where beta = 1. yardstick uses the more general name because the same function computes F1, F2, F0.5, and any other beta. When people say "F1" they mean the default call; when they say "F-measure" they may mean any beta. Set beta explicitly so reviewers know the weighting.
Does f_meas() handle multiclass classification?
Yes, with one extra decision: how to average per-class scores. yardstick defaults to estimator = "macro" for multiclass data, which computes F1 for each class against the rest and takes an unweighted mean. Pass estimator = "macro_weighted" to weight by class prevalence, or "micro" to pool predictions globally.
When should I use F2 instead of F1?
Use F2 when missing a positive costs more than a false alarm: medical screening, fraud detection, equipment failure prediction. Use F0.5 in the opposite case, where false positives are the expensive error.
Why does f_meas return NaN on my multiclass data?
A NaN F-measure means at least one class has precision or recall undefined: the model never predicted it, or the truth never contained it. Switch to estimator = "macro_weighted" so empty classes drop by weight, or inspect conf_mat() to confirm which class is missing.