yardstick mn_log_loss() in R: Cross-Entropy Loss Metric
The yardstick mn_log_loss() function in R scores how well a classifier's predicted probabilities match the observed classes, returning a tidy one-row summary of the mean negative log-likelihood (cross-entropy) for binary or multiclass problems.
mn_log_loss(df, truth, .pred_class1) # binary call mn_log_loss(df, truth = obs, .pred_class1) # named truth column mn_log_loss(df, class, .pred_yes, event_level = "second") # flip positive class df |> group_by(fold) |> mn_log_loss(class, .pred_yes) # per resample mn_log_loss(df, class, .pred_a, .pred_b, .pred_c) # multiclass mn_log_loss(df, class, .pred_a:.pred_c, sum = TRUE) # sum, not mean mn_log_loss_vec(truth_vec, prob_vec) # vector interface
Need explanation? Read on for examples and pitfalls.
What mn_log_loss() measures
mn_log_loss() answers how confident and correct a probabilistic classifier is. Pass a tibble with the truth column and one or more probability columns; the function returns a one-row tibble with .metric, .estimator, and .estimate. The estimate is the mean negative log-likelihood of the observed classes under the predicted probabilities.
A perfect model scores 0. A model that always predicts the marginal class prevalence scores the entropy of the truth distribution. Lower is better, and the metric is unbounded above: a confident wrong prediction drives the loss toward infinity.
mn_log_loss() syntax and arguments
The signature follows the yardstick probability-metric family. Argument shape changes between binary and multiclass.
| Argument | Description |
|---|---|
data |
A data frame with truth and probability columns. |
truth |
Unquoted column name of the observed class labels (a factor). |
... |
Unquoted column names of the predicted probabilities. One column for binary; one per class for multiclass. |
na_rm |
If TRUE, drop rows where any column is missing before scoring. |
sum |
If TRUE, return the summed log loss instead of the mean. |
event_level |
"first" or "second"; for binary, names the positive class. |
case_weights |
Optional unquoted column for weighting rows. |
For binary, pass the positive-class probability column. For multiclass, pass every class probability column with tidyselect syntax in the same order as levels(truth). Truth must be a factor whose levels match the probability column names after the .pred_ prefix. Yardstick clips probabilities to [1e-15, 1 - 1e-15] so a 0 or 1 never produces -Inf.
Score classifiers: four worked examples
The examples build a two-class tibble you can run end to end. It mimics a churn-prediction problem with moderate signal.
Example 1 calls mn_log_loss() with the positive-class probability column. "yes" is the first factor level, so yardstick treats it as positive.
A 0.503 mean log loss beats the entropy floor of a 30/70 split (around 0.611), so the predictions carry real information.
Example 2 flips the positive class with event_level. Setting "second" makes "no" positive without touching the probability column.
Unlike roc_auc() or pr_auc(), the score is identical. Log loss is symmetric: scoring .pred_yes against "yes" and scoring 1 - .pred_yes against "no" give the same number. The argument exists for family consistency, not because it changes the math.
Example 3 scores per resample fold. Pair group_by() with mn_log_loss() to get one score per fold without a loop.
Example 4 uses the vector interface for ad-hoc checks. mn_log_loss_vec() accepts a truth factor and a probability vector and returns a plain numeric scalar.
metric_set(mn_log_loss, roc_auc, brier_class) returns a reusable function that scores calibration, ranking, and likelihood in a single call, the minimum honest report for any probabilistic classifier.Multiclass log loss
mn_log_loss() generalizes to K classes without an averaging argument. Yardstick uses the categorical cross-entropy formula directly, so the multiclass estimator label is simply "multiclass".
The score 1.144 sits just above the uniform-prediction reference of log(3) = 1.099, so these random probabilities are slightly worse than always predicting 1/3 per class. Switching to sum = TRUE returns the unscaled sum:
| Argument | What it returns | When to use |
|---|---|---|
sum = FALSE (default) |
Mean log loss across rows | Comparable across datasets, fold sizes |
sum = TRUE |
Summed log loss across rows | Aggregating across batches of equal weight |
Default to the mean for headline scores; use sum = TRUE only when you plan to divide by a custom denominator.
mn_log_loss vs roc_auc vs brier_class
Four metrics share the probability-input contract. The table sorts out the overlap.
| Function | What it scores | When to prefer it |
|---|---|---|
mn_log_loss() |
Mean negative log-likelihood | Penalize overconfident wrong predictions |
roc_auc() |
Probability positive outranks negative | Threshold-free ranking on balanced data |
pr_auc() |
Area under the PR curve | Rare-positive ranking on imbalanced data |
brier_class() |
Mean squared error of probabilities | Calibration of binary probabilities |
accuracy() |
Share of correct labels at one cutoff | Reporting a deployed threshold |
mn_log_loss() is the right call whenever probability quality matters more than ranking or a single threshold. Boosting and neural net objectives are typically log loss, so reporting it keeps train-time and test-time metrics aligned.
yardstick::mn_log_loss() matches sklearn.metrics.log_loss() exactly. Both clip probabilities to avoid -Inf and both default to the mean over rows. Pass sum = TRUE in yardstick to match normalize = False in sklearn.Common pitfalls
Three small mistakes account for most mn_log_loss() errors.
The first is feeding class labels instead of probabilities. mn_log_loss() needs the positive-class probability for binary or one column per class for multiclass. Passing .pred_class from parsnip::augment() raises an error because hard labels are not numeric in [0, 1].
The second is column-order drift on multiclass. The ... arguments are matched against levels(truth) by position. If levels(truth) is c("a", "b", "c") but you pass .pred_b, .pred_a, .pred_c, yardstick assigns the wrong probability to each class.
The third pitfall is unnormalized probabilities. Yardstick does not check that the K columns sum to 1 for multiclass. If they sum to 1.10 or 0.85 from a softmax bug, the score is still a number, just a wrong one. Verify rowSums() equals 1 before scoring.
[1e-15, 1 - 1e-15], so a hard 0 on the true class produces a loss near 34.5 per row, not Inf. One such row can dominate the mean. Cap engine outputs to [0.001, 0.999] before scoring.Try it yourself
Try it: Use the built-in two_class_example data from yardstick. The truth column is truth and the probability column for the positive class (Class1) is Class1. Compute the mean log loss, then compute the summed log loss with sum = TRUE. Save the second result to ex_mll_sum.
Click to reveal solution
Explanation: Setting sum = TRUE returns the summed log loss across all 500 rows. Dividing by 500 recovers the default mean estimate of about 0.377; the two numbers are the same metric at different scales.
Related yardstick metrics
mn_log_loss() is one entry in the yardstick probability-metric family. Reach for these when log loss alone is not enough:
roc_auc()for threshold-free ranking qualitypr_auc()for ranking on imbalanced databrier_class()for calibration error of binary probabilitiesclassification_cost()for cost-weighted misclassificationaccuracy(),recall(),precision()for one-threshold scoring
For the full set, see the yardstick reference index.
FAQ
What is a good mn_log_loss score in R?
Mean log loss is unbounded above and bounded below by 0, so "good" depends on the class prior. The no-information baseline is the entropy -sum(p * log(p)). On a balanced two-class problem that floor is log(2) = 0.693; any model below adds value. Strong calibrated classifiers on real data typically land between 0.10 and 0.40.
Does mn_log_loss() need probabilities or class labels?
It needs probabilities. The function evaluates the negative log-likelihood of the observed classes under the predicted distribution, which requires numbers in [0, 1] per class. Pass probability columns from parsnip::augment(), usually named .pred_<class>. Feeding .pred_class (the hard label) raises a type error because the column is a factor.
When should I use mn_log_loss instead of accuracy?
Use mn_log_loss() whenever the downstream decision depends on a probability rather than a hard label. Calibration drives expected value in fraud, churn, advertising, and medical screening; accuracy throws that information away. Report log loss as the primary metric and accuracy as a sanity check on the deployed threshold.
Why does yardstick clip probabilities before computing log loss?
A predicted probability of exactly 0 for the true class would push log(p) to -Inf and the mean would be undefined. Yardstick clips inputs to [1e-15, 1 - 1e-15], matching scikit-learn. The clip means a hard wrong call costs roughly 34.5 per row, severe but finite.
How is mn_log_loss different from cross-entropy in deep learning?
They are the same quantity. Categorical cross-entropy in Keras or PyTorch is exactly the multiclass mean negative log-likelihood that mn_log_loss() computes. The only difference is input shape: frameworks expect a softmax tensor while yardstick expects one probability column per class.