rsample group_initial_split() in R: Group-Safe Splits
The rsample group_initial_split() function in R creates a train/test partition that keeps every row of a group in the same set, preventing data leakage when observations are clustered by patient, customer, store, or session.
group_initial_split(df, group_var) # default 75/25 by group group_initial_split(df, group_var, prop = 0.8) # custom group proportion group_initial_split(df, group_var, strata = y) # group split + outcome strata training(split) # extract training rows testing(split) # extract testing rows set.seed(123); group_initial_split(df, group_var) # reproducible group split group_initial_split(df, group_var, pool = 0.05) # small group pooling
Need explanation? Read on for examples and pitfalls.
What group_initial_split() does
group_initial_split() partitions a data frame once by group, so every row sharing a group identifier stays together in training or testing. It belongs to the rsample package, the resampling engine of the tidymodels ecosystem. Instead of sampling individual rows, it samples whole groups and assigns each one entirely to a single side of the split.
The point of a grouped split is to stop information from a single subject leaking across both sets. If a patient has ten visits and three appear in training while seven appear in testing, the model sees the same patient twice and reports an inflated accuracy that will not hold on new patients. group_initial_split() guarantees that does not happen.
Syntax and arguments
The signature mirrors initial_split() with one extra argument that names the grouping column.
The arguments you reach for in practice:
- data: the data frame or tibble to partition.
- group: unquoted column name that holds the group identifier (subject ID, customer ID, store ID).
- prop: target fraction of rows in training. The actual ratio shifts slightly because whole groups are assigned at random.
- strata: optional outcome column, applied at the group level when stratification is needed.
- pool: strata levels smaller than this fraction get pooled together before splitting.
group_initial_split() examples
Basic grouped split with simulated patient data
Call group_initial_split() with a data frame and the grouping column to see how rows distribute by group. Each subject_id ends up entirely in one set.
Confirm no subject appears in both sets
The whole point of grouped splitting is that the training and testing subject lists are disjoint. Inspect the unique IDs in each set to verify.
Set a custom group proportion
Pass prop to change the fraction of groups assigned to training. With 20 subjects, prop = 0.8 sends 16 subjects to training and 4 to testing.
Stratify the outcome while preserving groups
Combine group with strata when the outcome is imbalanced and observations are clustered. rsample assigns each group to a stratum first, then splits within strata.
group_initial_split() vs other split functions
group_initial_split() is the right choice whenever rows share an identifier that the model should not see twice. The other rsample splitters serve different scenarios.
| Function | Produces | Use when |
|---|---|---|
group_initial_split() |
One split with groups intact | Clustered or repeated-measures data |
initial_split() |
One row-level random split | Independent observations |
group_vfold_cv() |
k folds, groups intact | Cross-validation on clustered data |
initial_time_split() |
One time-ordered split | Time series or panel data |
group_bootstraps() |
Many group-level resamples | Variance estimates on clustered data |
A clinical pipeline often pairs both: group_initial_split() to carve off a final patient-level hold-out, then group_vfold_cv() on the training patients for hyperparameter tuning.
Common pitfalls
Three mistakes account for most grouped-split bugs.
- Forgetting set.seed(). group_initial_split() shuffles groups at random. Without a seed before the call, the partition changes every run and your reported metrics drift between sessions.
- Using initial_split() on clustered data. This is the silent failure mode this function fixes. A normal random split scatters a single subject across both sets, the model memorizes that subject, and offline accuracy looks great while production accuracy collapses.
- Confusing group with strata. group keeps observations together; strata balances an outcome variable. They solve different problems and can be combined. Passing the outcome to group will create one group per outcome value and ruin the split.
Try it yourself
Try it: Use the simulated patients data above. Make a 70/30 group_initial_split by subject_id, then save the unique training subject IDs to ex_train_ids.
Click to reveal solution
Explanation: prop = 0.70 targets 70 percent of the 20 subjects, which rounds to 14 in training and 6 in testing. The row count will not be exactly 70/30 because group sizes are equal here, but in real data the row ratio drifts further.
Related rsample functions
group_initial_split() is the entry point for grouped workflows; these extend it.
training()andtesting(): extract the two data frames from a grouped split object.group_vfold_cv(): build k-fold cross-validation folds with groups intact.group_bootstraps(): generate group-level bootstrap resamples.group_mc_cv(): Monte Carlo cross-validation that respects group membership.initial_split(): the row-level counterpart for independent observations.
GroupShuffleSplit followed by a single next() call. The group argument matches scikit-learn's groups parameter, and the function honors strata in the same way.FAQ
When should I use group_initial_split() instead of initial_split()?
Use group_initial_split() whenever a single real-world entity contributes multiple rows. Common cases include repeated patient visits, multiple orders per customer, multiple sessions per user, longitudinal measurements, and panel data. A regular initial_split() shuffles rows independently, so the same subject can land in training and testing. The model then partly memorizes that subject and offline performance becomes optimistic.
Does group_initial_split() guarantee an exact 75/25 row ratio?
No. The function guarantees that whole groups are kept together and targets the prop value at the group level. Because group sizes vary, the realized row ratio drifts. With 20 subjects and prop = 0.75, you get 15 groups in training and 5 in testing exactly, but the row counts depend on how many observations each subject contributes. For tighter row balance, prefer groups of similar size or accept the drift.
Can I use group_initial_split() with a numeric group identifier?
Yes. The group argument accepts any column type that uniquely identifies a cluster, including integers, characters, and factors. rsample treats each distinct value as a group label. If the column has accidental duplicates across what you consider different clusters, those clusters are merged into one group, which usually is not what you want. Verify the column has the right cardinality before splitting.
How does group_initial_split() interact with set.seed()?
Call set.seed() with any integer immediately before the function. The random selection of groups is deterministic given a fixed seed and a fixed R version, so the same script reproduces the same partition. Moving the seed to a different script, switching R versions, or rerunning with a different rsample version can shift which groups end up where. Keep the seed and the split in the same chunk for reproducibility.