tune collect_metrics() in R: Extract Tuning Metrics
The tune collect_metrics() function in R pulls performance metrics out of a tune_results object, returning a tidy tibble with one row per hyperparameter candidate (and metric) so you can rank, plot, or join the results downstream.
collect_metrics(res) # aggregated mean and std_err collect_metrics(res, summarize = FALSE) # per-resample, per-candidate collect_metrics(res, type = "wide") # one column per metric collect_metrics(last_fit_obj) # test set metrics from last_fit() collect_metrics(res) |> filter(.metric == "rmse") # keep one metric collect_metrics(res) |> arrange(mean) # rank candidates by mean collect_metrics(res) |> group_by(.metric) |> slice_min(mean, n = 1) # best per metric
Need explanation? Read on for examples and pitfalls.
What collect_metrics() does in one sentence
collect_metrics() unpacks the metric column of a tune_results tibble into a flat, mergeable summary. When tune_grid(), fit_resamples(), tune_bayes(), or last_fit() runs, each row of the returned object stores a nested list column called .metrics with the per-resample numbers. collect_metrics() walks that column, optionally averages across folds, and hands you a tibble with the tuning parameters, .metric, .estimator, mean, n, and std_err.
This is the function you almost always call right after a tuning run finishes. show_best() and autoplot() rely on the same data; collect_metrics() is the version you reach for when you want full control over filtering, grouping, or joining.
Set up a tune_results object
You need a fitted tuning result before collect_metrics() does anything useful. The block below tunes the elastic-net penalty and mixture on a 5-fold split of the Ames housing data so the rest of the page has a real object to work with.
The .metrics list column is exactly what collect_metrics() will flatten.
collect_metrics() syntax and arguments
The signature is intentionally minimal.
x is the tuning result. summarize = TRUE aggregates across folds; summarize = FALSE returns one row per fold per candidate per metric. type = "wide" pivots .metric into columns, which is handy for plotting two metrics against the same parameter axis.
The four switches above interact with the input shape. The table below shows which columns to expect for each combination, so you know which downstream code to write.
| Input type | summarize | Returns | Key columns |
|---|---|---|---|
| tune_grid / tune_bayes | TRUE | one row per candidate per metric | params, .metric, mean, n, std_err |
| tune_grid / tune_bayes | FALSE | one row per resample per candidate per metric | params, id, .metric, .estimate |
| fit_resamples | TRUE | one row per metric | .metric, mean, n, std_err |
| last_fit | (ignored) | one row per metric on the test set | .metric, .estimate |
Aggregated vs per-resample results
The default aggregates across folds; pass summarize = FALSE to keep individual fold scores. Aggregated output is what you usually want for model selection. Per-resample output is what you want for variance plots, paired comparisons, or to feed into perf_mod().
Notice the column rename: aggregated output has mean and std_err, per-resample has .estimate and id. Code that consumes both modes must check which columns exist.
n column tells you how many resamples actually scored. If n is less than the resample count you set up, some folds failed silently. Check collect_notes(res) to see why and decide whether the mean is still trustworthy.Long vs wide format
Wide format gives one row per candidate and one column per metric. This is what you want when you plot rmse against rsq, or when a downstream join expects a single key per parameter set.
The std_err columns are dropped in wide mode. Use long for variance work, wide for plotting and joining.
Joining with show_best() and select_best()
collect_metrics() is the source; show_best() and select_best() are convenience wrappers. They call collect_metrics() internally, then sort and slice. You can replicate either by hand and add custom filters along the way.
When you want to inspect a metric the default helpers do not surface (for example a custom yardstick metric inside a metric_set), collect_metrics() is the only path that gives you the full set.
Common pitfalls
Three failure modes account for most of the friction.
- Calling collect_metrics() on a workflow instead of the result. The workflow has no metrics yet. Run tune_grid() or fit_resamples() first; pass the returned object.
- Assuming
meanalways exists. Withsummarize = FALSE, the column is.estimate. Code that hard-codesmeanbreaks. Branch on summarize or rename early. - Forgetting the metric filter.
slice_min(mean, n = 1)without a.metricfilter picks the row with the smallest value across rmse, rsq, and any others. Always filter to a single metric or group by.metricbefore slicing.
Try it yourself
Try it: From the res tibble above, use collect_metrics() to find the candidate with the smallest RMSE and save its penalty and mixture to ex_best.
Click to reveal solution
Explanation: collect_metrics() flattens the nested .metrics column, the filter keeps RMSE rows only, and slice_head() after arrange() picks the smallest mean. select() trims to the two tuning parameters so ex_best is a one-row, two-column tibble ready to plug into finalize_workflow().
Related tune functions
- show_best() returns the top-N candidates pre-sorted on one metric.
- select_best() returns the single best candidate as a one-row tibble.
- collect_predictions() pulls out-of-fold predictions instead of metrics.
- autoplot() on tune_results uses collect_metrics() output to draw faceted curves.
- finalize_workflow() takes the tibble select_best() returns and fills in the tune() placeholders.
For the official reference, see the tune package documentation.
FAQ
What is the difference between collect_metrics() and show_best()?
show_best() is collect_metrics() plus a filter on one metric, an arrange by mean, and a slice_head() of the top N. If you want all metrics, custom filters, or the full table for plotting, call collect_metrics() directly. If you just want the leaderboard for one metric, show_best() saves three lines.
Why does collect_metrics() return NA for std_err sometimes?
NA appears when only one resample produced a finite value for that metric. With n = 1, the standard error is undefined. The fix is upstream: check collect_notes() for fold-level errors, add more resamples, or remove problematic candidates.
Can I use collect_metrics() with last_fit()?
Yes. last_fit() returns a one-row tibble that behaves like a single-resample tune_results object. collect_metrics() on it gives the test-set scores (one row per metric in the metric_set you passed). There is no summarize choice in that case because there is only one fold.
Does collect_metrics() work with custom yardstick metrics?
Yes. Any metric that was part of the metric_set() at tune time is included automatically. The column shape is identical: .metric carries the custom metric name and the rest of the columns are unchanged.
Can I pivot back from wide to long?
Yes, but the more direct path is to re-call collect_metrics(res, type = "long"). The wide form drops std_err, so manual pivoting from wide cannot reconstruct the standard errors. Always derive both views from the same tune_results object.