yardstick rmse() in R: Score Regression Models

The yardstick rmse() function in R returns the root mean squared error of a regression model, accepting a tibble with truth and estimate columns and producing a tidy one-row summary that reads in the same units as the outcome variable.

By Selva Prabhakaran · Published May 23, 2026 · Last updated May 23, 2026

⚡ Quick Answer

rmse(df, truth, estimate)                          # basic call
rmse(df, truth = obs, estimate = pred)             # named arguments
rmse(df, solubility, prediction)                   # default tidymodels output
df |> group_by(fold) |> rmse(obs, pred)            # by resample
rmse(df, obs, pred, na_rm = TRUE)                  # drop missing rows
rmse_vec(truth_vec, pred_vec)                      # vector interface
rmse(df, obs, pred, case_weights = w)              # weighted RMSE

Need explanation? Read on for examples and pitfalls.

📊 Is rmse() the right tool?

What rmse() measures

rmse() squares each residual, averages them, then takes the square root. You pass a data frame with the observed numeric outcome and the predicted values, and the function returns a one-row tibble with three columns: .metric, .estimator, and .estimate. The estimate is a non-negative number on the same scale as the outcome, so an RMSE of 0.7 on a solubility score means predictions are off by roughly 0.7 units on average.

Because squaring punishes large misses, RMSE is more outlier-sensitive than mean absolute error. That sensitivity is a feature when one big miss matters more than many small ones, and a bug when noisy outliers dominate the score. Pair RMSE with mae() and rsq() whenever you compare regression models.

Key Insight

RMSE inherits the units of your outcome. An RMSE of 3 on a price-in-dollars target means typical errors of about $3; on a log-price target it means a multiplicative factor of about exp(3). Always interpret RMSE on the scale you trained on.

rmse() syntax and arguments

The signature matches every other yardstick numeric metric. Once you know the shape, the same call works for mae(), rsq(), mape(), and the rest of the regression family.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

Rrmse generic signature

rmse(data, truth, estimate, na_rm = TRUE, case_weights = NULL, ...)

Argument	Description
`data`	A data frame with the truth and estimate columns.
`truth`	Unquoted column name of the observed numeric outcome.
`estimate`	Unquoted column name of the predicted numeric values.
`na_rm`	If `TRUE`, drop rows where either column is missing before scoring.
`case_weights`	Optional column of row weights for survey or importance-weighted data.

Truth and estimate must both be numeric; factor, character, or logical inputs raise an error. If you fitted a classification model, reach for accuracy() or roc_auc() instead.

Score regression models: four examples

The examples below use yardstick's built-in solubility_test data, which ships a real regression prediction set. First, load the package and inspect the data.

RLoad yardstick and inspect solubility data

library(yardstick) library(dplyr) data("solubility_test") head(solubility_test, 4) #> # A tibble: 4 x 2 #> solubility prediction #> <dbl> <dbl> #> 1 0.93 0.99 #> 2 0.85 0.71 #> 3 0.17 0.18 #> 4 -0.99 -0.57

Example 1 calls rmse() with positional arguments. The function picks up truth and estimate by position and returns the tidy summary.

RBasic rmse score on solubility predictions

rmse(solubility_test, solubility, prediction) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.722

The .estimator column reports standard because RMSE has no binary or multiclass variant. The estimate of 0.722 is in log-solubility units.

Example 2 uses named arguments on a fresh prediction frame. Named arguments are clearer when truth and estimate sit next to other features in a larger tibble.

RNamed-argument rmse on a custom frame

set.seed(1) custom <- tibble( obs = rnorm(100, mean = 50, sd = 10), pred = rnorm(100, mean = 50, sd = 10) ) rmse(custom, truth = obs, estimate = pred) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 14.1

Example 3 groups scoring by resample fold. When predictions from cross-validation live in one tibble, group_by() plus rmse returns one score per fold, giving instant per-resample diagnostics.

RPer-fold rmse from a cross-validation tibble

folded <- solubility_test |> mutate(fold = rep(paste0("fold", 1:5), length.out = n())) folded |> group_by(fold) |> rmse(truth = solubility, estimate = prediction) #> # A tibble: 5 x 4 #> fold .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 fold1 rmse standard 0.730 #> 2 fold2 rmse standard 0.713 #> 3 fold3 rmse standard 0.737 #> 4 fold4 rmse standard 0.709 #> 5 fold5 rmse standard 0.722

Example 4 uses the vector interface for quick checks. Inside map() calls or unit tests, rmse_vec() returns a plain scalar instead of a one-row tibble.

RVector interface returns a numeric scalar

rmse_vec(solubility_test$solubility, solubility_test$prediction) #> [1] 0.7222584

Use the vector form when you need a scalar for thresholds or unit tests; otherwise stay with the data-frame form so you can bind, group, or plot scores.

Tip

Bind multiple metric calls with bind_rows(). Because every yardstick metric returns the same three columns, bind_rows(rmse(...), mae(...), rsq(...)) produces a tidy three-row scorecard you can pass straight to ggplot() or gt() with no reshaping.

RMSE is the most popular regression metric, but rarely the only one you should report. Pick a partner from the table when RMSE alone hides the story.

Metric	Best use case	Limitation
`rmse()`	Errors in outcome units, large misses matter	Heavily penalises outliers
`mae()`	Errors in outcome units, robust to outliers	Treats a tiny miss and a small miss the same way
`rsq()`	Need a unit-free 0-to-1 goodness-of-fit	Can mask large systematic bias
`mape()`	Communicate relative error to non-technical readers	Explodes when truth is near zero
`rmsle()`	Outcome spans orders of magnitude, skewed	Undefined for negative predictions
`ccc()`	Want agreement plus calibration in one number	Less familiar to most readers

A safe default is RMSE as the headline number, MAE for outlier sensitivity, and R-squared for unit-free comparison across models with the same target.

Common pitfalls

Three small mistakes account for most rmse() failures. Each one has a one-line fix.

The first is passing a factor column. yardstick refuses factor inputs because a factor implies classification, not regression. The fix is to cast to numeric before scoring:

RFix: cast factor predictions to numeric

bad <- tibble(obs = c(1.2, 3.4, 5.6), pred = factor(c("1", "3", "6"))) # rmse(bad, obs, pred) # would error bad <- bad |> mutate(pred = as.numeric(as.character(pred))) rmse(bad, obs, pred) #> # A tibble: 1 x 3 #> .metric .estimator .estimate #> <chr> <chr> <dbl> #> 1 rmse standard 0.471

The second pitfall is comparing RMSE across different target transformations. An RMSE of 0.2 on log-price and 50 on raw-price are not on the same scale. Back-transform predictions before scoring, or stick to one target representation.

The third pitfall is missing values. With na_rm = TRUE (the default), yardstick silently drops NA rows, which changes the denominator. Pre-filter the prediction frame when comparing models that handle missingness differently.

Warning

One big outlier can swamp RMSE. Because errors are squared, a single residual of 100 contributes more to RMSE than a hundred residuals of 10. When you see RMSE jump after a small data change, plot residuals first; do not chase the metric.

Try it yourself

Try it: Use the built-in solubility_test data. Compute the overall RMSE, then mutate a bucket column that splits predictions into "low" (below 0) and "high" (0 or above), group by bucket, and report RMSE per group. Save the per-bucket result to ex_rmse_by_bucket.

RYour turn: bucketed rmse

library(yardstick) library(dplyr) data("solubility_test") # Try it: rmse by prediction bucket ex_rmse_by_bucket <- # your code here ex_rmse_by_bucket #> Expected: 2 rows, one per bucket

Click to reveal solution

RSolution

ex_rmse_by_bucket <- solubility_test |> mutate(bucket = if_else(prediction < 0, "low", "high")) |> group_by(bucket) |> rmse(truth = solubility, estimate = prediction) ex_rmse_by_bucket #> # A tibble: 2 x 4 #> bucket .metric .estimator .estimate #> <chr> <chr> <chr> <dbl> #> 1 high rmse standard 0.706 #> 2 low rmse standard 0.739

Explanation: Grouping before calling rmse() returns one row per bucket without reshaping. The "low" bucket has slightly higher RMSE, which suggests the model is a touch noisier when it predicts low solubility values.

rmse() is one entry in the yardstick numeric-metric family. Reach for these neighbors when RMSE alone is not enough:

mae() for an outlier-robust error in the same units
mape() for percentage error reporting to stakeholders
rsq() for a unit-free 0-to-1 goodness-of-fit
rmsle() for skewed targets where multiplicative error matters
ccc() for concordance correlation, useful in clinical and reliability work
metrics() to compute several regression scores in a single call

For the full set, see the yardstick reference index.

FAQ

What is a good RMSE value?

There is no universal threshold. RMSE is in the units of your outcome, so a "good" value depends on the spread of that outcome. Compare RMSE against the standard deviation of the truth column: if RMSE is much smaller, the model is adding signal; if it is close, the model is barely better than predicting the mean. Always interpret alongside rsq() and a residual plot.

How is rmse() different from sqrt(mean((y - yhat)^2))?

They return the same number. yardstick wraps the formula in a function that validates inputs, handles missing values, supports case_weights, and returns a tidy tibble that integrates with metrics() and group_by(). Use rmse() everywhere for consistency with the tidymodels workflow.

Can I weight rows when computing RMSE?

Yes. Pass a case_weights column to rmse(). Common uses are survey weights, sample weights from parsnip::fit(), or importance weights. The result is a weighted root mean squared error on the outcome scale. Weights must be non-negative.

Why does rmse() return a tibble instead of a number?

The tidy return shape is the yardstick convention. Every metric returns the same three columns: .metric, .estimator, .estimate. That uniformity lets you bind_rows() calls or pipe into group_by() with no reshape step. Call rmse_vec() when you only need the scalar.

Should I prefer RMSE or MAE for tuning?

If you care most about avoiding large individual errors, tune on RMSE. If your business cost is roughly linear in error size, tune on MAE. The difference matters when the error distribution is heavy-tailed: RMSE favours models that flatten the tail; MAE favours models that minimise typical error.

Summary

rmse() is the default regression scorecard in yardstick. Use it for the headline number, lean on mae() and rsq() for partners, and switch to rmse_vec() only when you need a scalar. Combined with group_by() it scales to any resampling scheme without reshaping, and combined with bind_rows(metrics(...)) it produces a clean multi-metric report ready to publish.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick rmse() in R: Score Regression Models

What rmse() measures

rmse() syntax and arguments

Score regression models: four examples

Common pitfalls

Try it yourself

FAQ

Summary

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

yardstick rmse() in R: Score Regression Models

What rmse() measures

rmse() syntax and arguments

Score regression models: four examples

rmse() compared with related metrics

Common pitfalls

Try it yourself

Related yardstick metrics

FAQ

Summary