rsample group_vfold_cv() in R: Group-Aware CV Splits

The rsample group_vfold_cv() function in R builds v-fold cross-validation splits that keep every observation from the same group inside a single fold, so models trained on hierarchical data never see leaked information from the assessment set.

By Selva Prabhakaran · Published May 22, 2026 · Last updated May 22, 2026

⚡ Quick Answer

group_vfold_cv(df, patient_id, v = 4)              # 4 group-aware folds
group_vfold_cv(df, patient_id)                     # leave-one-group-out
group_vfold_cv(df, patient_id, v = 5, balance = "observations")  # balance rows
group_vfold_cv(df, patient_id, v = 5, balance = "groups")        # equal groups per fold
group_vfold_cv(df, patient_id, v = 5, repeats = 3) # repeated grouped CV
analysis(folds$splits[[1]])                        # training rows of fold 1
assessment(folds$splits[[1]])                      # held-out rows of fold 1
set.seed(42); group_vfold_cv(df, patient_id)       # reproducible folds

Need explanation? Read on for examples and pitfalls.

📊 Is group_vfold_cv() the right tool?

What group_vfold_cv() does

group_vfold_cv() partitions a data frame into v folds so that every observation from a single group lands in exactly one fold. It belongs to the rsample package, the resampling engine of the tidymodels stack. Groups are defined by a variable you pass to the group argument: patient IDs, customer IDs, sentence IDs, sensor IDs, anything that ties multiple rows to the same underlying unit.

The function returns a tibble of rsplit objects in a splits list-column. Each split stores the row indices that belong to the analysis (training) portion and the assessment (held-out) portion of one fold. Because group membership is preserved, a patient who appears five times in the data will contribute all five rows to one and only one fold.

Key Insight

Group leakage is the silent killer of clustered-data models. If the same patient, customer, or session appears in both the training and assessment portion of a fold, the model has effectively seen the test answer key during training. Ordinary vfold_cv() cannot prevent this. group_vfold_cv() is the fix.

Syntax and arguments

The signature has one required argument plus four tuning knobs.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

Rgroup_vfold_cv signature

group_vfold_cv( data, group = NULL, v = NULL, repeats = 1, balance = c("groups", "observations"), ... )

Argument	Purpose	Default
`data`	Data frame or tibble to resample	required
`group`	Column whose values define the groups	required
`v`	Number of folds; `NULL` means leave-one-group-out	`NULL`
`repeats`	Number of independent resampling rounds	`1`
`balance`	`"groups"` for equal group counts per fold; `"observations"` for equal row counts	`"groups"`

Setting v = NULL produces one fold per unique group, the leave-one-group-out variant. Setting v to a smaller integer buckets groups together. The balance argument matters when group sizes are uneven and you care about fold-to-fold sample size stability.

group_vfold_cv() examples

Five examples cover the practical shape of grouped resampling. They use a small synthetic patient cohort with 8 patients and 5 visits each. Each block builds on the previous one, so variables persist across runs.

Group-aware 4-fold CV

Build the data first, then ask for 4 folds. A naive split would shuffle visits across folds; the grouped call keeps all five visits per patient together.

RLoad tidymodels and build group data

library(tidymodels) set.seed(42) patients <- tibble( patient_id = rep(paste0("P", 1:8), each = 5), visit = rep(1:5, times = 8), bp = round(rnorm(40, mean = 130, sd = 8)) ) head(patients, 6) #> # A tibble: 6 x 3 #> patient_id visit bp #> <chr> <int> <dbl> #> 1 P1 1 142 #> 2 P1 2 125 #> 3 P1 3 127 #> 4 P1 4 136 #> 5 P1 5 131 #> 6 P2 1 135

Each patient has five rows. Pass the data and the group column to group_vfold_cv().

RBuild 4 group-aware folds

folds <- group_vfold_cv(patients, group = patient_id, v = 4) folds #> # Group V-fold cross-validation #> # A tibble: 4 x 2 #> splits id #> <list> <chr> #> 1 <split [30/10]> Resample1 #> 2 <split [30/10]> Resample2 #> 3 <split [30/10]> Resample3 #> 4 <split [30/10]> Resample4

Each split shows [30/10], meaning 30 analysis rows and 10 assessment rows. With 8 patients and 4 folds, each fold holds out exactly 2 patients, which is 10 rows.

Confirm zero group leakage

No patient should appear in both halves of a split. Verify it directly with intersect().

RConfirm no patient appears in both sets

train1 <- analysis(folds$splits[[1]]) test1 <- assessment(folds$splits[[1]]) intersect(unique(train1$patient_id), unique(test1$patient_id)) #> character(0)

An empty character vector is the success signal. If you ever see a non-empty intersection here, group preservation has broken and you have a bug upstream.

Leave-one-group-out (v = NULL)

Drop the v argument to get one fold per group. With 8 patients, you get 8 folds and each fold holds out exactly one patient.

RLeave-one-group-out cross-validation

logo <- group_vfold_cv(patients, group = patient_id) nrow(logo) #> [1] 8 assessment(logo$splits[[1]]) |> pull(patient_id) |> unique() #> [1] "P1"

Leave-one-group-out is appropriate when you have a moderate number of groups (10 to 50) and need every group evaluated independently. With hundreds of groups it gets expensive, so pick a smaller v.

Tip

Choose v based on group count, not row count. With 50 patients, v = 5 gives 10 patients per fold; with 500 patients, v = 10 gives 50 per fold. Keep at least 5 groups in every assessment fold so per-fold metrics are not dominated by a single noisy group.

Balance by observations instead of groups

Switch the balance argument when group sizes vary widely. The default balance = "groups" produces folds with similar group counts but uneven row counts; balance = "observations" flips the priority.

RBalance folds by observations instead of group count

uneven <- tibble( group_id = c(rep("A", 2), rep("B", 4), rep("C", 10), rep("D", 14)), y = rnorm(30) ) obs_folds <- group_vfold_cv(uneven, group = group_id, v = 2, balance = "observations") map_int(obs_folds$splits, ~ nrow(assessment(.x))) #> [1] 14 16

Compare that to the default. Two folds, two groups each, but very different row counts.

RDefault balance puts equal group counts per fold

grp_folds <- group_vfold_cv(uneven, group = group_id, v = 2, balance = "groups") map_int(grp_folds$splits, ~ nrow(assessment(.x))) #> [1] 12 18

Repeated grouped CV

Repeats stabilize the performance metric. Set repeats = 3 to run the grouping procedure three independent times.

RRepeated grouped CV

set.seed(7) rep_folds <- group_vfold_cv(patients, group = patient_id, v = 4, repeats = 3) nrow(rep_folds) #> [1] 12

You get 12 splits total (4 folds times 3 repeats). Average your metric across all 12 for a tighter estimate.

group_vfold_cv() vs vfold_cv()

Use group_vfold_cv() the moment your rows are not independent. The two functions look similar but solve different problems.

Function	Use when	Risk if misused
`vfold_cv()`	Rows are independent (one row per unit)	None for IID data
`group_vfold_cv()`	Multiple rows belong to the same unit (panel, repeated measures, hierarchical)	Performance estimates inflated by leakage if you use vfold_cv() instead

Here is the leakage demonstrated. The naive vfold_cv() on the patient data splits one patient's visits across both halves.

RWhy ordinary v-fold leaks group signal

set.seed(1) naive <- vfold_cv(patients, v = 4) train_p <- analysis(naive$splits[[1]])$patient_id test_p <- assessment(naive$splits[[1]])$patient_id length(intersect(unique(train_p), unique(test_p))) #> [1] 8

All 8 patients appear in both the training and assessment portion of fold 1. A model fit on the training portion has already seen rows belonging to every patient it will be scored against. The resulting RMSE will be optimistically low.

Warning

A clean train/test split is not enough on its own. If you call initial_split() without grouping and then call vfold_cv() on the training portion, the same group can still cross folds inside the training data. Use group_initial_split() paired with group_vfold_cv() for a fully group-safe pipeline.

Common pitfalls

Three mistakes account for most failures with this function.

Forgetting the group argument. The function does not infer the grouping variable; you have to pass it. Missing it errors immediately.

RPitfall: missing group argument

group_vfold_cv(patients, v = 4) #> Error in `group_vfold_cv()`: ! `group` is required when calling #> `group_vfold_cv()`.

Quoting the group name unnecessarily. Tidyselect-style unquoted names work; so do strings; mixing them with double-bang !! does not.

RPitfall: do not double-bang

group_var <- "patient_id" group_vfold_cv(patients, group = !!group_var, v = 4) #> Error: object 'group_var' not found, or invalid !! context. # Correct: use .data pronoun or a bare string group_vfold_cv(patients, group = group_var, v = 4)

Setting v larger than the group count. If you have 8 patients and ask for v = 10, the function errors. The maximum useful v is the unique group count.

Try it yourself

Try it: Build a leave-one-group-out resampling object from mtcars using cyl as the group variable, then count how many folds you got. Save the result to ex_cv.

RYour turn: LOGO on mtcars

# Try it: leave-one-cyl-out CV ex_cv <- # your code here nrow(ex_cv) #> Expected: 3

Click to reveal solution

RSolution

ex_cv <- group_vfold_cv(mtcars, group = cyl) nrow(ex_cv) #> [1] 3

Explanation: mtcars$cyl has 3 unique values (4, 6, 8). With v left at the default of NULL, group_vfold_cv() builds one fold per group, giving 3 splits.

rsample vfold_cv() for standard v-fold CV when rows are independent
rsample loo_cv() for leave-one-out CV at the row level
rsample mc_cv() for Monte Carlo random splits
rsample initial_split() for the underlying train/test partition
rsample bootstraps() for bootstrap resampling

External reference: the rsample group_vfold_cv documentation on the tidymodels site lists the full argument set and edge cases.

FAQ

What is the difference between group_vfold_cv and vfold_cv?

vfold_cv() shuffles rows randomly into folds and assumes rows are independent. group_vfold_cv() assigns entire groups (all rows sharing a group ID) to one fold, so the same group never appears in both the training and assessment portion of a split. Use the grouped variant whenever multiple rows belong to the same patient, customer, session, or hierarchical unit.

When should I leave v as NULL?

Leave v = NULL to get leave-one-group-out cross-validation, one fold per unique group. It works well when group counts are moderate (roughly 10 to 50). With hundreds of groups, the full pass becomes slow and individual fold estimates become noisy, so pick a smaller integer v instead.

Can I stratify and group at the same time?

Not directly. group_vfold_cv() does not expose a strata argument because the grouping constraint and the stratification constraint can conflict. If you need both, a common workaround is to compute a group-level summary of the stratifier first, then pass the groups through group_vfold_cv(). The tidymodels team treats this as an open enhancement.

Does group_vfold_cv work with time series?

It can keep panels of time series together (one panel per group), but it does not enforce chronological order inside a fold. For time-aware grouped resampling, combine grouping with a rolling origin or use sliding_window() from rsample with a grouping pre-step. Pure group_vfold_cv() is fine for cross-sectional clustered data, not sequential forecasting.

Is the function reproducible?

Yes. Call set.seed() immediately before group_vfold_cv() and the same groups will land in the same folds every run. The seed governs the random assignment of groups to folds; the rows inside each group never reshuffle because they always move together.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

rsample group_vfold_cv() in R: Group-Aware CV Splits

What group_vfold_cv() does

Syntax and arguments

group_vfold_cv() examples

Group-aware 4-fold CV

Confirm zero group leakage

Leave-one-group-out (v = NULL)

Balance by observations instead of groups

Repeated grouped CV

group_vfold_cv() vs vfold_cv()

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

rsample group_vfold_cv() in R: Group-Aware CV Splits

What group_vfold_cv() does

Syntax and arguments

group_vfold_cv() examples

Group-aware 4-fold CV

Confirm zero group leakage

Leave-one-group-out (v = NULL)

Balance by observations instead of groups

Repeated grouped CV

group_vfold_cv() vs vfold_cv()

Common pitfalls

Try it yourself

Related rsample functions

FAQ