yardstick mcc() in R: Matthews Correlation Coefficient
The yardstick mcc() function in R computes the Matthews correlation coefficient between observed and predicted class labels, returning a single number between -1 and 1 that stays meaningful even when the classes are heavily imbalanced.
mcc(df, truth, estimate) # basic two-class call mcc(df, truth = obs, estimate = pred) # named arguments mcc(df, class, .pred_class) # default parsnip output df |> group_by(fold) |> mcc(class, .pred_class) # by resample mcc(df, class, .pred_class, na_rm = TRUE) # drop missing rows mcc_vec(truth_vec, pred_vec) # vector interface mcc(df, class, .pred_class, case_weights = w) # weighted scoring
Need explanation? Read on for examples and pitfalls.
What mcc() measures
mcc() is a correlation coefficient between predicted and observed classes. For two classes it equals the phi coefficient, computed from all four cells of the confusion matrix. The score is 1 for perfect agreement, 0 for chance, and -1 for perfectly inverted predictions.
The function takes a data frame with two factor columns and returns a one-row tibble with .metric, .estimator, and .estimate. Because every confusion-matrix cell enters the formula, the score collapses whenever the model leans on one class at the expense of another.
Reach for MCC when you want a single headline number that resists class imbalance. It is the default in bioinformatics and fraud-detection benchmarks for that reason.
mcc() syntax and arguments
The signature mirrors every other yardstick class metric. The same call shape works whether you score a logistic regression, a random forest, or a neural net.
| Argument | Description |
|---|---|
data |
A data frame with the truth and estimate columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
estimate |
Unquoted column name of the predicted class labels (a factor with matching levels). |
na_rm |
If TRUE, drop rows where either column is missing before scoring. |
case_weights |
Optional column of row weights, useful for survey or class-rebalanced data. |
... |
Currently unused; reserved for future extensions. |
Both truth and estimate must be factors with identical levels in the same order. There is no estimator argument here, because yardstick chooses the binary or multiclass formula automatically from the number of factor levels.
Examples: MCC for binary and multiclass
The examples below score a binary problem first, then a multiclass one. Start with a deliberately imbalanced two-class data set where one label dominates.
Example 1 contrasts accuracy with MCC on the same imbalanced data. Accuracy looks healthy because the model agrees with the dominant class most of the time. MCC tells the real story.
Accuracy is 0.825 but MCC sits barely above zero. The 82 percent figure is inflated by correct rejections of the majority class; MCC sees that the model adds almost nothing on top of always predicting "no".
Example 2 scores a multiclass setup. For three or more classes, yardstick uses the multiclass MCC due to Gorodkin (2004): the result is a scalar that generalizes the binary phi coefficient to a K by K confusion table.
The .estimator column reports multiclass. There is no macro vs micro choice to make; yardstick computes the single scalar directly from the full confusion matrix.
Example 3 groups scoring by resample fold. With cross-validation predictions in one tibble, group_by() plus mcc returns one score per fold.
Example 4 uses the vector interface for ad-hoc checks. mcc_vec() takes two factors directly and returns a numeric scalar instead of a tibble.
The vector form is the right choice inside map() calls, custom tuning code, or unit tests where a tibble would be overhead.
class_metrics <- metric_set(accuracy, mcc) then class_metrics(df, truth, estimate) returns a two-row tibble you can pivot_wider() or plot directly.mcc() compared with related metrics
MCC sits in the same family as f_meas() and bal_accuracy(), but reads them differently. The table below summarizes when to swap one for another.
| Metric | Best use case | Limitation |
|---|---|---|
mcc() |
Single robust score on imbalanced or balanced data | Harder to explain than accuracy or F1 |
accuracy() |
Balanced classes, equal error costs | Misleading on skewed data |
bal_accuracy() |
Imbalanced classes, equal cost per class | Hides class-specific failure modes |
f_meas() |
Precision-recall blend, harmonic mean | Equal precision and recall weighting is a choice |
j_index() |
Two-class screening, Youden's J | Binary only |
kap() |
Chance-corrected agreement | Sensitive to marginal frequencies |
A practical rule: report MCC as the headline whenever any class is below 30 percent of the data, and add roc_auc() if you have probabilities.
Common pitfalls
Three small mistakes account for most mcc() errors. Each one has a clear fix.
The first is passing character vectors instead of factors. yardstick needs factors to confirm that both columns share the same levels in the same order. The fix is one factor() call:
The second pitfall is reading MCC on the same scale as accuracy. MCC ranges from -1 to 1, not 0 to 1. A score of 0.3 is solid; 0.7 is excellent. Label the scale on plots so stakeholders used to accuracy do not misread it.
The third pitfall is a degenerate confusion matrix. When the model predicts only one class, one denominator in the formula goes to zero and yardstick returns 0 by convention. Inspect conf_mat() before trusting a near-zero score.
conf_mat() so you can tell the two cases apart.Try it yourself
Try it: Use the built-in hpc_cv data from yardstick, which contains predictions across ten resamples for a multiclass problem. Compute MCC per resample, then sort the result. Save the per-resample tibble to ex_mcc_by_fold.
Click to reveal solution
Explanation: Grouping by Resample before calling mcc gives one multiclass MCC per fold. Sorting by .estimate highlights the strongest fold, which you would then inspect with conf_mat() for class-specific patterns.
Related yardstick metrics
mcc() is one entry in the yardstick class-metric family. Reach for these neighbors when MCC alone is not enough:
accuracy()for the unbalanced baseline you compare againstbal_accuracy()for the macro-recall view of the same problemf_meas()for a precision-recall blendkap()for chance-corrected agreementj_index()for the binary Youden's Jroc_auc()andpr_auc()for probability-based ranking metrics
For the full set, see the yardstick reference index.
FAQ
How is MCC calculated for two classes?
For binary classification, MCC reads the four cells (TP, TN, FP, FN) and returns TP times TN minus FP times FN, divided by the square root of the four marginal totals multiplied. The result is the phi coefficient, identical to a Pearson correlation between two binary vectors.
When should I prefer mcc() over accuracy()?
Use MCC whenever any class is rare or the costs of false positives and false negatives are not symmetric. Accuracy can sit at 95 percent while a model ignores the minority class entirely; MCC drops to near zero in that case and exposes the failure with a single number.
Does mcc() work for multiclass classification?
Yes. yardstick switches .estimator from binary to multiclass based on the number of factor levels. The multiclass formula is the Gorodkin generalization of the phi coefficient and returns a single scalar in [-1, 1] from the full K by K confusion matrix.
Can I compute MCC from a parsnip workflow?
Pipe augment() output into mcc. With augment(fit, new_data = test) the result already contains the truth column and .pred_class, so the call becomes augment(fit, test) |> mcc(class, .pred_class). Replace class with the actual outcome column name.
What does an MCC of zero really mean?
Zero indicates no correlation between predicted and observed classes, which usually means the model predicts at random. yardstick also returns zero when the model predicts a single class for every row, even if that class dominates truth. Inspect conf_mat() to distinguish "random" from "degenerate" before reporting.
Summary
mcc() is the right yardstick metric the moment class balance becomes a worry. Read it next to accuracy() so the gap quantifies how much imbalance is hiding, and reach for roc_auc() or f_meas() when you need probability rankings or a precision-recall blend. With group_by() it scales to any resampling scheme without reshaping.