rsample initial_split() in R: Make a Train/Test Split
The rsample initial_split() function in R creates a single train/test partition of your data, so you can fit a model on one slice and measure honest performance on data it never saw.
initial_split(df) # default 75/25 split initial_split(df, prop = 0.8) # custom 80/20 split initial_split(df, strata = y) # stratified by outcome training(split) # extract training set testing(split) # extract testing set initial_split(df, prop = 0.7, strata = y) # stratified custom split set.seed(123); initial_split(df) # reproducible split
Need explanation? Read on for examples and pitfalls.
What initial_split() does
initial_split() partitions a data frame once into a training set and a testing set. It belongs to the rsample package, the resampling engine of the tidymodels ecosystem. The function does not copy your data twice. It returns a lightweight rsplit object that stores the row indices for each partition, and you pull the actual data frames out later with training() and testing().
The point of a train/test split is honesty. You fit and tune your model on the training set, then score it once on the testing set to estimate how it will behave on genuinely new data. Touching the testing set during development leaks information and inflates your reported accuracy.
Syntax and arguments
The signature is short, and most calls only need the first one or two arguments.
The arguments you reach for in practice:
- data: the data frame or tibble to partition.
- prop: the fraction of rows assigned to training. The default
3/4gives a 75/25 split. - strata: a column name. The split is performed within each level of this variable so class proportions are preserved.
- breaks: when
stratais numeric, the number of quantile bins used to stratify. - pool: strata levels smaller than this fraction of the data are pooled together before splitting.
initial_split() examples
Basic 75/25 split
Call initial_split() on a data frame and print the result to see the partition sizes. The iris dataset has 150 rows, so the default 3/4 proportion sends 112 rows to training.
Extract the training and testing sets
The split object is not a data frame, so extract the two sets with training() and testing(). These helpers return ordinary data frames you can pass to any modeling function.
Set a custom proportion
Pass prop to change the split ratio. A value of 0.80 keeps 80 percent of rows for training, a common choice when the dataset is small.
Stratify by the outcome
Use strata to keep the outcome distribution balanced across both sets. Without stratification a random split can leave one class over-represented in training. Here every species stays evenly split.
initial_split() vs other resampling functions
initial_split() makes one split; the other rsample functions make many. Choose based on how you plan to estimate model performance.
| Function | Produces | Use when |
|---|---|---|
initial_split() |
One train/test split | Final hold-out evaluation |
vfold_cv() |
k cross-validation folds | Tuning or robust performance estimates |
bootstrap() / bootstraps() |
Many resamples with replacement | Small data, variance estimates |
initial_time_split() |
One time-ordered split | Time series, no shuffling allowed |
group_initial_split() |
One split keeping groups intact | Grouped or clustered observations |
A typical tidymodels workflow uses both: initial_split() to carve off a final testing set, then vfold_cv() on the training set to tune the model.
iris_split shows <112/38/150> rather than rows because rsample keeps pointers into the original data frame. This is why splits are cheap to create and why training() and testing() are the only way to get usable data frames out.Common pitfalls
Three mistakes account for most train/test split bugs.
- Forgetting set.seed().
initial_split()shuffles rows at random. Withoutset.seed()before the call, every run produces a different split and your results are not reproducible. Set the seed in the same script, right before the split. - Using the testing set too early. Calling
testing()to inspect or tune your model leaks information. Reserve it for one final evaluation after all modeling decisions are locked. - Stratifying a continuous outcome without breaks. When
stratais numeric, rsample bins it into quantiles usingbreaks. The default of 4 is usually fine, but very skewed targets may need a different value, and tiny strata levels trigger thepoolwarning.
set.seed(123) followed by initial_split() gives the same partition every time, but moving the seed to another script or changing the R version can shift results. Keep the seed and the split together.Try it yourself
Try it: Split the mtcars dataset into a 70/30 train/test partition, then save the training rows to ex_train.
Click to reveal solution
Explanation: prop = 0.70 sends 70 percent of the 32 rows to training. The product 32 * 0.70 = 22.4 is rounded down, so 22 rows land in ex_train and 10 in the testing set.
Related rsample functions
initial_split() is the entry point; these functions extend the workflow.
training()andtesting(): extract the two data frames from a split object.vfold_cv(): build k-fold cross-validation folds from the training set.bootstraps(): generate bootstrap resamples for variance estimates.group_initial_split(): split while keeping all rows of a group together.initial_validation_split(): create a three-way train/validation/test split.
initial_split() plus training() and testing() is the tidymodels equivalent of train_test_split(X, y, test_size=0.25). The strata argument matches scikit-learn's stratify.FAQ
What is the default split ratio for initial_split()?
The default prop argument is 3/4, which sends 75 percent of rows to the training set and 25 percent to the testing set. For a 150-row dataset that is a 112/38 split. Pass a different prop value, such as 0.8, to change the ratio. There is no single correct ratio: 75/25 and 80/20 are both common, and larger datasets can afford a smaller testing fraction.
How do I make initial_split() reproducible?
Call set.seed() with any integer immediately before initial_split(). The function selects training rows at random, so a fixed seed guarantees the same partition every time the script runs. Keep the seed and the split in the same script. Re-running with the same seed and the same R version reproduces the exact rows in each set.
What is the difference between initial_split() and vfold_cv()?
initial_split() creates one train/test split for a final hold-out evaluation. vfold_cv() creates k folds for cross-validation, used to tune hyperparameters or get a more stable performance estimate. They work together: split off a testing set with initial_split(), then apply vfold_cv() to the training portion during model tuning.
When should I use the strata argument?
Use strata when the outcome is imbalanced, such as a rare-event classification target. Stratified splitting preserves the class proportions in both the training and testing sets, so neither set is missing a class or skewed. For a numeric outcome, rsample bins the variable into quantiles first, controlled by the breaks argument.
Can initial_split() handle time series data?
Not directly. initial_split() shuffles rows, which breaks the time order a forecasting model depends on. Use initial_time_split() instead. It keeps the first prop fraction of rows as training and the most recent rows as testing, so the split respects chronology.