rsample loo_cv() in R: Leave-One-Out Cross-Validation
The rsample loo_cv() function in R builds leave-one-out cross-validation splits, where each fold holds out exactly one observation and trains on the remaining n-1 rows. It returns one split per row, so it is the right tool for small datasets where you need an almost unbiased estimate of model performance.
loo_cv(df) # one split per row of df loo_cv(df) |> nrow() # equals nrow(df) analysis(folds$splits[[1]]) # n-1 training rows for fold 1 assessment(folds$splits[[1]]) # the 1 held-out row for fold 1 fit_resamples(wf, resamples = loo_cv(df)) # tidymodels workflow on LOO loo_cv(df) |> dim() # n folds, 2 columns (splits, id)
Need explanation? Read on for examples and pitfalls.
What loo_cv() does
loo_cv() generates one resample per observation. For a data frame with n rows, it returns an rset with n splits. Each split's analysis set holds n-1 rows and its assessment set holds exactly 1 row. This is the limiting case of k-fold cross-validation where k equals n.
Leave-one-out cross-validation is theoretically the lowest-bias estimator of out-of-sample model error because every fit uses almost the full dataset. The trade-off is computational cost: you fit the model n times instead of 5 or 10. For most practical problems with hundreds of rows or more, 10-fold cross-validation gives a similar estimate at a fraction of the cost.
Syntax and arguments
loo_cv() takes a data frame and returns an rset. The signature is short because there is no v argument, no repeats argument, no strata argument. The number of folds is fixed at nrow(data).
The output is a tibble with two columns. splits is a list-column where each element is an rsplit object pointing at indices into the original data. id is a character label, useful for joining per-fold metrics back to the resample table.
loo_cv() examples
Inspect a single split to see the n-1 / 1 partition. Use analysis() for the training rows and assessment() for the held-out row.
For a model evaluation loop, iterate over the splits, fit the model on the analysis set, then predict the single row in the assessment set.
With tidymodels, fit_resamples() accepts an rset directly, so you can plug loo_cv(df) in wherever vfold_cv(df) would go.
mean column, not the per-fold values.loo_cv() vs vfold_cv() and other resampling
Pick the resampler that matches your data size and bias-variance trade-off. Most analysts default to 10-fold CV. Leave-one-out becomes attractive only at small n.
| Function | When to use | Folds | Cost |
|---|---|---|---|
loo_cv(df) |
n under 100, unbiased estimate needed | n | High (n fits) |
vfold_cv(df, v = 10) |
General purpose, 100 to 10,000 rows | 10 | Low |
vfold_cv(df, v = 5, repeats = 5) |
Variance-reduced estimate | 25 | Medium |
bootstraps(df) |
Standard errors, confidence intervals | 25 by default | Medium |
mc_cv(df) |
Tune the train/test ratio per fold | configurable | Low to Medium |
Common pitfalls
Three mistakes show up regularly. Each has a small fix.
- Running loo_cv() on a large dataset. With n = 5,000 rows and a model that takes 1 second to fit, you wait 83 minutes per evaluation. Switch to
vfold_cv(df, v = 10)unless you have a specific reason to pay the full LOO cost.
- Mistaking the rset for a list of data frames.
loo_cv()returns a tibble ofrsplitobjects, not actual sub-data. You must callanalysis(split)orassessment(split)to materialize the rows for that fold.
- Trying to stratify.
loo_cv()has nostrataargument because each held-out fold is one row, so stratification is meaningless. If you wanted stratified small-sample CV, usevfold_cv(df, v = nrow(df), strata = y)to get the same fold count with the stratification machinery.
set.seed() before fit_resamples() to keep the run reproducible.Try it yourself
Try it: Run leave-one-out cross-validation on the first 10 rows of mtcars with the formula mpg ~ wt, then compute the LOO root-mean-squared error. Save the result to ex_loo_rmse.
Click to reveal solution
Explanation: Each fold fits a one-variable linear regression on 9 rows and predicts the 10th. Squaring the held-out residuals and averaging gives the LOO mean squared error; the square root is the RMSE you report.
Related rsample functions
These rsample functions cover related resampling needs. Reach for them when LOO does not fit.
vfold_cv(): k-fold cross-validation, the default choice for medium datasetsinitial_split(): a single train/test split for a held-out test setbootstraps(): bootstrap resamples for standard errors and confidence intervalsmc_cv(): Monte Carlo cross-validation with a configurable train ratiorolling_origin(): time-series-aware resampling that respects observation order
Full argument reference at rsample.tidymodels.org.
FAQ
When should I use loo_cv() instead of vfold_cv()?
Use loo_cv when your dataset has fewer than about 100 rows and you need the least-biased estimate of out-of-sample error. With 50 rows, 10-fold cross-validation removes 5 rows per fold; for some models that 10 percent loss noticeably changes the fit. Leave-one-out removes one row at a time, so each training fold is almost the full data. The cost is that you fit the model n times instead of 10, which is fine at small n and prohibitive at large n.
Why does loo_cv() not have a v argument like vfold_cv()?
The v argument controls the number of folds. Leave-one-out is the special case where v equals the number of rows, so the argument is redundant. If you want a configurable number of folds, call vfold_cv with v = nrow(df) for equivalent behavior, or with v = 10 for the standard choice. Calling vfold_cv(df, v = nrow(df)) lets you add strata if you need stratification at the LOO fold count.
Is leave-one-out cross-validation the same as jackknife resampling?
Yes, in the way the splits are built. The jackknife also drops one observation at a time. The naming differs by application: statisticians call it the jackknife when used for bias and variance estimation of a statistic, and leave-one-out cross-validation when used for predictive model evaluation. loo_cv() builds the splits; what you compute on them decides the label.
Does loo_cv() work with grouped data?
No, loo_cv has no group argument. For grouped cross-validation, use group_vfold_cv() with the column that identifies the grouping. That gives you grouped k-fold; there is no direct grouped-LOO helper in rsample because grouping plus leaving one row out collapses to standard LOO whenever the group size is 1.
How long will loo_cv() take to run on my dataset?
Multiply your single-model fit time by the number of rows. A linear regression that takes 0.01 seconds on 200 rows finishes in 2 seconds. A random forest that takes 5 seconds on 1,000 rows takes 83 minutes for the full LOO sweep. Estimate the budget before launching the full run; switch to 10-fold cross-validation if the answer exceeds your patience.