rsample group_vfold_cv() in R: Group-Aware CV Splits
The rsample group_vfold_cv() function in R builds v-fold cross-validation splits that keep every observation from the same group inside a single fold, so models trained on hierarchical data never see leaked information from the assessment set.
group_vfold_cv(df, patient_id, v = 4) # 4 group-aware folds group_vfold_cv(df, patient_id) # leave-one-group-out group_vfold_cv(df, patient_id, v = 5, balance = "observations") # balance rows group_vfold_cv(df, patient_id, v = 5, balance = "groups") # equal groups per fold group_vfold_cv(df, patient_id, v = 5, repeats = 3) # repeated grouped CV analysis(folds$splits[[1]]) # training rows of fold 1 assessment(folds$splits[[1]]) # held-out rows of fold 1 set.seed(42); group_vfold_cv(df, patient_id) # reproducible folds
Need explanation? Read on for examples and pitfalls.
What group_vfold_cv() does
group_vfold_cv() partitions a data frame into v folds so that every observation from a single group lands in exactly one fold. It belongs to the rsample package, the resampling engine of the tidymodels stack. Groups are defined by a variable you pass to the group argument: patient IDs, customer IDs, sentence IDs, sensor IDs, anything that ties multiple rows to the same underlying unit.
The function returns a tibble of rsplit objects in a splits list-column. Each split stores the row indices that belong to the analysis (training) portion and the assessment (held-out) portion of one fold. Because group membership is preserved, a patient who appears five times in the data will contribute all five rows to one and only one fold.
vfold_cv() cannot prevent this. group_vfold_cv() is the fix.Syntax and arguments
The signature has one required argument plus four tuning knobs.
| Argument | Purpose | Default |
|---|---|---|
data |
Data frame or tibble to resample | required |
group |
Column whose values define the groups | required |
v |
Number of folds; NULL means leave-one-group-out |
NULL |
repeats |
Number of independent resampling rounds | 1 |
balance |
"groups" for equal group counts per fold; "observations" for equal row counts |
"groups" |
Setting v = NULL produces one fold per unique group, the leave-one-group-out variant. Setting v to a smaller integer buckets groups together. The balance argument matters when group sizes are uneven and you care about fold-to-fold sample size stability.
group_vfold_cv() examples
Five examples cover the practical shape of grouped resampling. They use a small synthetic patient cohort with 8 patients and 5 visits each. Each block builds on the previous one, so variables persist across runs.
Group-aware 4-fold CV
Build the data first, then ask for 4 folds. A naive split would shuffle visits across folds; the grouped call keeps all five visits per patient together.
Each patient has five rows. Pass the data and the group column to group_vfold_cv().
Each split shows [30/10], meaning 30 analysis rows and 10 assessment rows. With 8 patients and 4 folds, each fold holds out exactly 2 patients, which is 10 rows.
Confirm zero group leakage
No patient should appear in both halves of a split. Verify it directly with intersect().
An empty character vector is the success signal. If you ever see a non-empty intersection here, group preservation has broken and you have a bug upstream.
Leave-one-group-out (v = NULL)
Drop the v argument to get one fold per group. With 8 patients, you get 8 folds and each fold holds out exactly one patient.
Leave-one-group-out is appropriate when you have a moderate number of groups (10 to 50) and need every group evaluated independently. With hundreds of groups it gets expensive, so pick a smaller v.
v = 5 gives 10 patients per fold; with 500 patients, v = 10 gives 50 per fold. Keep at least 5 groups in every assessment fold so per-fold metrics are not dominated by a single noisy group.Balance by observations instead of groups
Switch the balance argument when group sizes vary widely. The default balance = "groups" produces folds with similar group counts but uneven row counts; balance = "observations" flips the priority.
Compare that to the default. Two folds, two groups each, but very different row counts.
Repeated grouped CV
Repeats stabilize the performance metric. Set repeats = 3 to run the grouping procedure three independent times.
You get 12 splits total (4 folds times 3 repeats). Average your metric across all 12 for a tighter estimate.
group_vfold_cv() vs vfold_cv()
Use group_vfold_cv() the moment your rows are not independent. The two functions look similar but solve different problems.
| Function | Use when | Risk if misused |
|---|---|---|
vfold_cv() |
Rows are independent (one row per unit) | None for IID data |
group_vfold_cv() |
Multiple rows belong to the same unit (panel, repeated measures, hierarchical) | Performance estimates inflated by leakage if you use vfold_cv() instead |
Here is the leakage demonstrated. The naive vfold_cv() on the patient data splits one patient's visits across both halves.
All 8 patients appear in both the training and assessment portion of fold 1. A model fit on the training portion has already seen rows belonging to every patient it will be scored against. The resulting RMSE will be optimistically low.
initial_split() without grouping and then call vfold_cv() on the training portion, the same group can still cross folds inside the training data. Use group_initial_split() paired with group_vfold_cv() for a fully group-safe pipeline.Common pitfalls
Three mistakes account for most failures with this function.
- Forgetting the
groupargument. The function does not infer the grouping variable; you have to pass it. Missing it errors immediately.
- Quoting the group name unnecessarily. Tidyselect-style unquoted names work; so do strings; mixing them with double-bang
!!does not.
- Setting
vlarger than the group count. If you have 8 patients and ask forv = 10, the function errors. The maximum usefulvis the unique group count.
Try it yourself
Try it: Build a leave-one-group-out resampling object from mtcars using cyl as the group variable, then count how many folds you got. Save the result to ex_cv.
Click to reveal solution
Explanation: mtcars$cyl has 3 unique values (4, 6, 8). With v left at the default of NULL, group_vfold_cv() builds one fold per group, giving 3 splits.
Related rsample functions
- rsample vfold_cv() for standard v-fold CV when rows are independent
- rsample loo_cv() for leave-one-out CV at the row level
- rsample mc_cv() for Monte Carlo random splits
- rsample initial_split() for the underlying train/test partition
- rsample bootstraps() for bootstrap resampling
External reference: the rsample group_vfold_cv documentation on the tidymodels site lists the full argument set and edge cases.
FAQ
What is the difference between group_vfold_cv and vfold_cv?
vfold_cv() shuffles rows randomly into folds and assumes rows are independent. group_vfold_cv() assigns entire groups (all rows sharing a group ID) to one fold, so the same group never appears in both the training and assessment portion of a split. Use the grouped variant whenever multiple rows belong to the same patient, customer, session, or hierarchical unit.
When should I leave v as NULL?
Leave v = NULL to get leave-one-group-out cross-validation, one fold per unique group. It works well when group counts are moderate (roughly 10 to 50). With hundreds of groups, the full pass becomes slow and individual fold estimates become noisy, so pick a smaller integer v instead.
Can I stratify and group at the same time?
Not directly. group_vfold_cv() does not expose a strata argument because the grouping constraint and the stratification constraint can conflict. If you need both, a common workaround is to compute a group-level summary of the stratifier first, then pass the groups through group_vfold_cv(). The tidymodels team treats this as an open enhancement.
Does group_vfold_cv work with time series?
It can keep panels of time series together (one panel per group), but it does not enforce chronological order inside a fold. For time-aware grouped resampling, combine grouping with a rolling origin or use sliding_window() from rsample with a grouping pre-step. Pure group_vfold_cv() is fine for cross-sectional clustered data, not sequential forecasting.
Is the function reproducible?
Yes. Call set.seed() immediately before group_vfold_cv() and the same groups will land in the same folds every run. The seed governs the random assignment of groups to folds; the rows inside each group never reshuffle because they always move together.