caret upSample() in R: Oversample Minority Class for Balance
The upSample() function in caret balances a classification dataset by sampling rows from minority classes with replacement until every class matches the majority count. It returns one combined data frame containing the predictors plus a renamed outcome column, and it works for binary or multiclass factor targets out of the box.
upSample(x = predictors, y = labels) # default, balances all classes upSample(x, y, yname = "Outcome") # rename the outcome column upSample(x, y, list = TRUE) # list with $x and $y separately upSample(train[, -1], train$Class) # typical training-only call upSample(iris[, 1:4], iris$Species) # no-op when already balanced table(upSample(x, y)$Class) # confirm equal class counts set.seed(1); upSample(x, y) # reproducible draws nrow(upSample(x, y)) # majority count times n classes
Need explanation? Read on for examples and pitfalls.
What upSample() does in one sentence
upSample() is caret's random oversampler for classification. You give it a predictor matrix and a factor outcome, and it draws rows from each minority class with replacement until every class has the same count as the largest one, returning a single data frame with the predictors plus a Class column.
The function exists because most classifiers minimise overall error, and an imbalanced training set lets a model hit high accuracy by predicting the majority for everything. A fraud-detection model trained on 99% legitimate transactions and 1% fraud can score 99% accuracy by predicting "legitimate" for every row, while catching zero fraud. Random oversampling on the training set lifts the minority signal so the loss function pays equal attention to each class. upSample() is the right starting point before reaching for SMOTE or cost-sensitive learning, and the baseline against which more sophisticated balancing strategies should be measured.
upSample() syntax and arguments
upSample() takes the predictors and the outcome separately, not a formula or a single data frame. Four arguments cover every option the function exposes.
The signature is short:
x: predictor data frame or matrix. Do not include the outcome column here.y: factor outcome vector. The class with the largest count sets the target size for every other class.list: ifFALSE(the default), return a single data frame. Set toTRUEfor a named list with$xand$yseparately.yname: name for the outcome column whenlist = FALSE. Defaults to"Class".
The returned object has one row per class times the majority count, so a 200/30 binary problem becomes 400 rows. For multiclass, every minority class is independently sampled up to the majority size, so a 100/60/20 split produces 300 rows. The added rows are exact duplicates of existing minority observations.
RandomOverSampler().fit_resample(X, y) from imblearn.over_sampling. Both draw minority rows with replacement until classes balance; upSample() returns one combined data frame while imblearn returns X and y as separate arrays.upSample() examples by use case
Most oversampling workflows split first, then upsample only the training partition. The examples below build from the basic call up to a pipeline with caret::train() and a held-out test set.
A single binary oversampling call on the toy data:
The minority class jumped from 30 rows to 200, matching the majority. The 170 added rows are bootstrap copies of the original 30 minority observations, so the predictor space is unchanged but each minority point now appears about 6.7 times on average.
The same call with list = TRUE keeps predictors and labels separate:
Use list = TRUE when downstream code expects X and y as separate arguments, for example glmnet() or xgboost::xgb.DMatrix().
A train/test split before oversampling, then balanced training:
The training set is balanced 140/140 while the test set keeps the natural imbalance. Evaluating on the imbalanced test set is what makes the metrics honest; balancing the test set hides the real deployment distribution.
Multiclass oversampling works the same way:
Every non-majority class is sampled with replacement up to the majority size. A 3-class problem with counts 100/60/20 becomes 300 rows after the call.
train() do it for you when tuning. Pass trainControl(sampling = "up") to caret's train() and the oversampling runs inside each resample, not on the full training set. This gives an honest performance estimate because the upsampling is recomputed per fold instead of leaking duplicated rows across the cross-validation splits.upSample() vs downSample() and other balancing tools
upSample() adds minority copies; downSample() drops majority rows; SMOTE creates synthetic minority points. Pick by what you can afford to lose and how much variance you tolerate.
| Method | Sample size | Risk | Best for |
|---|---|---|---|
upSample(x, y) |
n_classes * majority_count |
overfits to repeated minority rows | small datasets where dropping majority hurts |
downSample(x, y) |
n_classes * minority_count |
throws away majority information | very large datasets, fast training |
themis::step_smote() |
synthetic interpolations | distorts feature manifold | continuous features, large enough minority to interpolate |
trainControl(sampling = "up") |
resamples inside CV | none above resample noise | hyperparameter tuning with caret |
weights in train() |
unchanged | weights only, no resampling | log-loss or weighted likelihood models |
For most first attempts, oversample with upSample(), tune with trainControl(sampling = "up"), then compare against downSample() on the same folds. The choice often comes down to dataset size: with 10,000 majority and 200 minority rows, downsampling drops the training frame to 400 and the majority signal collapses, while upsampling keeps the majority intact and inflates the minority to 10,000.
Class weights are a third lever. Passing a per-row weight vector to train() tells the loss function to penalise minority errors more heavily without changing the row count. Weights are usually cheaper than oversampling, but only work when the model exposes a weights argument.
Common pitfalls
Three mistakes cause most upsampling bugs.
The first is oversampling before the train/test split. If you call upSample() on the full data and then split, identical minority copies appear on both sides of the split, leaking labels into evaluation. The test metric will look excellent and collapse in production. Always split first.
The second is reading test metrics on a balanced test set. After upSample(), accuracy on the balanced sample is meaningless because the prior distribution has been rewritten. Score on the natural-prevalence test partition and report precision, recall, F1, and AUPRC, not accuracy.
The third is forgetting to set a seed. upSample() draws random row indexes, so two calls produce different oversampled frames. Seed once before upSample() and the same balanced training frame is rebuilt every run.
Try it yourself
Try it: Take the built-in iris dataset, drop 40 rows from the virginica class to create imbalance, then use upSample() to rebalance and confirm all three species reach 50 rows.
Click to reveal solution
Explanation: setosa and versicolor already have 50 rows so they are passed through untouched. virginica jumps from 10 to 50 via sampling with replacement, leaving the balanced frame with 150 rows and a uniform class distribution.
Related caret functions
upSample() is one piece of caret's class-balancing toolkit.
downSample(x, y): random undersampling of the majority class to match the minority size.createDataPartition(y, p = 0.7): stratified train/test split, runs before any sampling.trainControl(sampling = "up"): runsupSample()inside each resample duringtrain().train(method, weights = w): class-weighted fitting without resampling.confusionMatrix(predicted, actual): scores a classifier on the natural-prevalence test set.
See the caret documentation on subsampling for class imbalance for the authoritative reference.
FAQ
Why does upSample() rename my outcome column to Class?
The default yname = "Class" makes the returned frame work with caret's formula interface, where train(Class ~ ., data = balanced) finds the outcome by name. Pass yname = "Species" to keep the original name. Renaming is purely cosmetic; values are identical to the input factor.
Should I oversample the validation set too?
No. Oversample only the training partition. The validation and test sets should reflect the real-world class prior so metrics generalise to production. Scoring on a balanced validation set inflates accuracy and obscures the precision-recall trade-off at deployment.
How does upSample() compare with SMOTE?
upSample() duplicates existing minority rows, so the model sees the same points many times. SMOTE generates synthetic rows by interpolating between minority neighbours, expanding the predictor space in a way that can hurt high-cardinality categorical features. Start with upSample() for a baseline; switch to SMOTE only when interpolation is plausible.
Can I combine upSample() with cross-validation?
Yes, by passing trainControl(sampling = "up") to train(). caret then upsamples inside each CV fold so the resample estimates are honest. Calling upSample() on the full training set before train() leaks duplicated rows across folds.