yardstick kap() in R: Cohen's Kappa for Classification
The yardstick kap() function in R computes Cohen's Kappa between observed and predicted class labels, returning a chance-corrected score in [-1, 1] and acting as the default companion to accuracy() in tidymodels.
kap(df, truth, estimate) # basic two-class call kap(df, truth = obs, estimate = pred) # named arguments kap(df, class, .pred_class) # default parsnip output df |> group_by(fold) |> kap(class, .pred_class) # by resample kap(df, class, .pred_class, na_rm = TRUE) # drop missing rows kap_vec(truth_vec, pred_vec) # vector interface kap(df, class, .pred_class, case_weights = w) # weighted scoring
Need explanation? Read on for examples and pitfalls.
What kap() measures
kap() reports agreement between predictions and truth after subtracting the agreement expected by chance. The formula is (po - pe) / (1 - pe), where po is the observed share of matching labels and pe is the share expected under independence given the marginal class frequencies.
The function takes a data frame with two factor columns and returns a one-row tibble (.metric, .estimator, .estimate). A Kappa of 1 means perfect agreement; 0 means chance; negative values mean worse than guessing the marginals.
Reach for kap when you want one chance-corrected number that works for two classes or many. It is tune::tune_grid()'s default classification metric alongside accuracy(), so most tidymodels users see it without naming it.
pe. Two models with the same accuracy can land at very different Kappa values depending on how aligned their predictions are with the class prior.kap() syntax and arguments
The signature mirrors every other yardstick class metric. The same call shape works whether you score a logistic regression, a random forest, or an xgboost classifier.
| Argument | Description |
|---|---|
data |
A data frame with the truth and estimate columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
estimate |
Unquoted column name of the predicted class labels (a factor with matching levels). |
na_rm |
If TRUE, drop rows where either column is missing before scoring. |
case_weights |
Optional column of row weights, useful for survey or class-rebalanced data. |
... |
Currently unused; reserved for future extensions. |
Both truth and estimate must be factors with identical levels in the same order. yardstick picks the binary or multiclass formula from the level count, so there is no estimator argument to set.
Examples: Kappa for binary and multiclass
The examples below score a binary problem first, then a multiclass one. Start with a deliberately imbalanced two-class data set where one label dominates.
Example 1 contrasts accuracy with Kappa on the same imbalanced data. Accuracy looks healthy because the model agrees with the dominant class most of the time. Kappa removes that free agreement.
Accuracy is 0.825 but Kappa is barely above zero. The 82 percent figure is inflated by correct rejections of the majority class; Kappa cancels that free agreement and exposes how little the predictor adds.
Example 2 scores a multiclass setup. For three or more classes, yardstick computes Kappa from the full K by K confusion matrix.
The .estimator column reports multiclass. There is no macro vs micro switch; yardstick reduces the K by K matrix to one scalar via the Cohen formula generalised to K classes.
Example 3 groups scoring by resample fold. With CV predictions in one tibble, group_by() plus kap returns one score per fold.
Example 4 uses the vector interface for ad-hoc checks. kap_vec() takes two factors directly and returns a numeric scalar instead of a tibble.
Use the vector form inside map() calls, custom tuning code, or unit tests where a tibble is overhead.
class_metrics <- metric_set(accuracy, kap) gives the exact pair tune::tune_grid() uses, so class_metrics(df, truth, estimate) returns a two-row tibble you can pivot_wider() to compare runs.kap() compared with related metrics
Kappa sits in the chance-corrected family with mcc(), but they answer different questions. The table below summarises when to swap one for another.
| Metric | Best use case | Limitation |
|---|---|---|
kap() |
Chance-corrected agreement, default with tune() |
Sensitive to marginal class frequencies |
accuracy() |
Balanced classes, equal error costs | Misleading on skewed data |
mcc() |
Single robust score on heavily imbalanced data | Harder to explain than Kappa |
bal_accuracy() |
Imbalanced classes, equal cost per class | Hides class-specific failure modes |
f_meas() |
Precision-recall blend, harmonic mean | Equal precision and recall weighting is a choice |
j_index() |
Two-class screening, Youden's J | Binary only |
A practical rule: report Kappa next to accuracy whenever any class drops below 30 percent, and add mcc() if marginals are unstable across resamples.
vcd::Kappa() or irr::kappa2(). The yardstick variant assumes all misclassifications are equally bad.Common pitfalls
Three small mistakes account for most kap() errors. Each one has a clear fix.
The first is passing character vectors instead of factors. yardstick needs factors to confirm both columns share the same levels in the same order. The fix is one factor() call:
The second pitfall is reading Kappa on the accuracy scale. Landis and Koch labelled values above 0.61 as substantial and above 0.81 as almost perfect; a Kappa of 0.4 is fair, not failing. Annotate plots with this scale so accuracy-trained stakeholders do not misread it.
The third pitfall is comparing Kappa across data sets with very different class balance. The chance baseline pe depends on marginal frequencies, so a Kappa of 0.5 on a balanced split is not the same evidence as 0.5 on a 95/5 split. Always report class proportions alongside.
conf_mat() to tell "random" apart from "single-class collapse".Try it yourself
Try it: Use the built-in hpc_cv data from yardstick, which contains predictions across ten resamples for a multiclass problem. Compute Kappa per resample, then sort the result. Save the per-resample tibble to ex_kap_by_fold.
Click to reveal solution
Explanation: Grouping by Resample before calling kap returns one multiclass Kappa per fold. Sorting by .estimate surfaces the strongest fold, which you would then inspect with conf_mat() for class-specific patterns.
Related yardstick metrics
kap() is one entry in the yardstick class-metric family. Reach for these neighbors when Kappa alone is not enough:
accuracy()for the raw agreement baseline you compare againstmcc()for a chi-square based score that holds up under extreme skewbal_accuracy()for the macro-recall view of the same problemf_meas()for a precision-recall blendj_index()for the binary Youden's Jroc_auc()andpr_auc()for probability-based ranking metrics
For the full set, see the yardstick reference index.
FAQ
How is Cohen's Kappa calculated for two classes?
For binary classification, Kappa reads the confusion matrix to compute observed agreement po (rows where prediction matches truth) and expected agreement pe (agreement from sampling each column independently by its marginal frequency). The score is (po - pe) / (1 - pe): the numerator measures chance-corrected gain, the denominator scales it to a max of 1.
When should I prefer kap() over accuracy()?
Use Kappa whenever any class is rare or the model could clear a high accuracy bar by predicting the majority class. Accuracy can sit at 95 percent while Kappa lands near zero, exposing the imbalance instantly. Tidymodels reports both by default in tune_grid() for that reason.
Does kap() support weighted Kappa?
No. yardstick treats every misclassification equally, which is fine for nominal labels but loses information for ordinal ratings where misclassifying "high" as "low" is worse than as "medium". For weighted Kappa use vcd::Kappa() or irr::kappa2(..., weight = "squared") instead.
Can I compute Kappa from a parsnip workflow?
Yes. Pipe augment() output into kap. With augment(fit, new_data = test) the result already contains the truth column and .pred_class, so the call becomes augment(fit, test) |> kap(class, .pred_class). Replace class with the actual outcome column name in your data.
What does a negative Kappa mean?
A negative Kappa means the model agrees with truth less often than chance would predict from the marginals. That is rare and usually signals a flipped factor level, a wrong target column, or a model that memorised the inverse pattern. Inspect conf_mat() and recheck factor levels before blaming the predictor.
Summary
kap() is the chance-corrected partner to accuracy() and the default classification metric in tidymodels. Read it next to accuracy whenever class balance could inflate the raw agreement, and reach for mcc() when you need a score less sensitive to marginal frequencies. With group_by() it scales to any resampling scheme; metric_set(accuracy, kap) is the exact pair tune_grid() already uses.