caret upSample() in R: Oversample Minority Class for Balance

The upSample() function in caret balances a classification dataset by sampling rows from minority classes with replacement until every class matches the majority count. It returns one combined data frame containing the predictors plus a renamed outcome column, and it works for binary or multiclass factor targets out of the box.

By Selva Prabhakaran · Published May 22, 2026 · Last updated May 22, 2026

⚡ Quick Answer

upSample(x = predictors, y = labels)                  # default, balances all classes
upSample(x, y, yname = "Outcome")                     # rename the outcome column
upSample(x, y, list = TRUE)                           # list with $x and $y separately
upSample(train[, -1], train$Class)                    # typical training-only call
upSample(iris[, 1:4], iris$Species)                   # no-op when already balanced
table(upSample(x, y)$Class)                           # confirm equal class counts
set.seed(1); upSample(x, y)                           # reproducible draws
nrow(upSample(x, y))                                  # majority count times n classes

Need explanation? Read on for examples and pitfalls.

📊 Is upSample() the right tool?

What upSample() does in one sentence

upSample() is caret's random oversampler for classification. You give it a predictor matrix and a factor outcome, and it draws rows from each minority class with replacement until every class has the same count as the largest one, returning a single data frame with the predictors plus a Class column.

The function exists because most classifiers minimise overall error, and an imbalanced training set lets a model hit high accuracy by predicting the majority for everything. A fraud-detection model trained on 99% legitimate transactions and 1% fraud can score 99% accuracy by predicting "legitimate" for every row, while catching zero fraud. Random oversampling on the training set lifts the minority signal so the loss function pays equal attention to each class. upSample() is the right starting point before reaching for SMOTE or cost-sensitive learning, and the baseline against which more sophisticated balancing strategies should be measured.

upSample() syntax and arguments

upSample() takes the predictors and the outcome separately, not a formula or a single data frame. Four arguments cover every option the function exposes.

Run live

Run live, no install needed. Every R block on this page runs in your browser. Click Run, edit the code, re-run instantly. No setup.

RLoad caret and build an imbalanced dataset

library(caret) set.seed(42) df <- data.frame( x1 = c(rnorm(200, 0, 1), rnorm(30, 1.5, 1)), x2 = c(rnorm(200, 0, 1), rnorm(30, 1.5, 1)), Class = factor(c(rep("neg", 200), rep("pos", 30))) ) table(df$Class) #> #> neg pos #> 200 30

The signature is short:

RupSample signature

upSample(x, y, list = FALSE, yname = "Class")

x: predictor data frame or matrix. Do not include the outcome column here.
y: factor outcome vector. The class with the largest count sets the target size for every other class.
list: if FALSE (the default), return a single data frame. Set to TRUE for a named list with $x and $y separately.
yname: name for the outcome column when list = FALSE. Defaults to "Class".

The returned object has one row per class times the majority count, so a 200/30 binary problem becomes 400 rows. For multiclass, every minority class is independently sampled up to the majority size, so a 100/60/20 split produces 300 rows. The added rows are exact duplicates of existing minority observations.

Note

Coming from Python imbalanced-learn? The closest equivalent is RandomOverSampler().fit_resample(X, y) from imblearn.over_sampling. Both draw minority rows with replacement until classes balance; upSample() returns one combined data frame while imblearn returns X and y as separate arrays.

upSample() examples by use case

Most oversampling workflows split first, then upsample only the training partition. The examples below build from the basic call up to a pipeline with caret::train() and a held-out test set.

A single binary oversampling call on the toy data:

RBalance a binary classification frame

set.seed(1) balanced <- upSample(x = df[, c("x1", "x2")], y = df$Class) table(balanced$Class) #> #> neg pos #> 200 200 nrow(balanced) #> [1] 400

The minority class jumped from 30 rows to 200, matching the majority. The 170 added rows are bootstrap copies of the original 30 minority observations, so the predictor space is unchanged but each minority point now appears about 6.7 times on average.

The same call with list = TRUE keeps predictors and labels separate:

RList output for matrix workflows

set.seed(1) balanced_list <- upSample(x = df[, c("x1", "x2")], y = df$Class, list = TRUE) names(balanced_list) #> [1] "x" "y" dim(balanced_list$x) #> [1] 400 2 length(balanced_list$y) #> [1] 400

Use list = TRUE when downstream code expects X and y as separate arguments, for example glmnet() or xgboost::xgb.DMatrix().

A train/test split before oversampling, then balanced training:

RSplit first, upsample only the train half

set.seed(7) train_idx <- createDataPartition(df$Class, p = 0.7, list = FALSE) train <- df[train_idx, ] test <- df[-train_idx, ] set.seed(7) train_bal <- upSample(x = train[, c("x1", "x2")], y = train$Class) table(train_bal$Class) #> #> neg pos #> 140 140 table(test$Class) #> #> neg pos #> 60 9

The training set is balanced 140/140 while the test set keeps the natural imbalance. Evaluating on the imbalanced test set is what makes the metrics honest; balancing the test set hides the real deployment distribution.

Multiclass oversampling works the same way:

RMulticlass upSample on a 3-level factor

set.seed(3) multi <- data.frame( x1 = rnorm(180), Class = factor(c(rep("A", 100), rep("B", 60), rep("C", 20))) ) multi_bal <- upSample(x = multi[, "x1", drop = FALSE], y = multi$Class) table(multi_bal$Class) #> #> A B C #> 100 100 100

Every non-majority class is sampled with replacement up to the majority size. A 3-class problem with counts 100/60/20 becomes 300 rows after the call.

Tip

Let train() do it for you when tuning. Pass trainControl(sampling = "up") to caret's train() and the oversampling runs inside each resample, not on the full training set. This gives an honest performance estimate because the upsampling is recomputed per fold instead of leaking duplicated rows across the cross-validation splits.

upSample() vs downSample() and other balancing tools

upSample() adds minority copies; downSample() drops majority rows; SMOTE creates synthetic minority points. Pick by what you can afford to lose and how much variance you tolerate.

Method	Sample size	Risk	Best for
`upSample(x, y)`	`n_classes * majority_count`	overfits to repeated minority rows	small datasets where dropping majority hurts
`downSample(x, y)`	`n_classes * minority_count`	throws away majority information	very large datasets, fast training
`themis::step_smote()`	synthetic interpolations	distorts feature manifold	continuous features, large enough minority to interpolate
`trainControl(sampling = "up")`	resamples inside CV	none above resample noise	hyperparameter tuning with caret
`weights` in `train()`	unchanged	weights only, no resampling	log-loss or weighted likelihood models

For most first attempts, oversample with upSample(), tune with trainControl(sampling = "up"), then compare against downSample() on the same folds. The choice often comes down to dataset size: with 10,000 majority and 200 minority rows, downsampling drops the training frame to 400 and the majority signal collapses, while upsampling keeps the majority intact and inflates the minority to 10,000.

Class weights are a third lever. Passing a per-row weight vector to train() tells the loss function to penalise minority errors more heavily without changing the row count. Weights are usually cheaper than oversampling, but only work when the model exposes a weights argument.

Common pitfalls

Three mistakes cause most upsampling bugs.

The first is oversampling before the train/test split. If you call upSample() on the full data and then split, identical minority copies appear on both sides of the split, leaking labels into evaluation. The test metric will look excellent and collapse in production. Always split first.

The second is reading test metrics on a balanced test set. After upSample(), accuracy on the balanced sample is meaningless because the prior distribution has been rewritten. Score on the natural-prevalence test partition and report precision, recall, F1, and AUPRC, not accuracy.

Warning

Random oversampling exaggerates noise in tiny minorities. A 30-row positive class oversampled to 200 rows is still only 30 distinct points. If the minority is below 50 to 100 rows, prefer SMOTE or collect more data.

The third is forgetting to set a seed. upSample() draws random row indexes, so two calls produce different oversampled frames. Seed once before upSample() and the same balanced training frame is rebuilt every run.

Try it yourself

Try it: Take the built-in iris dataset, drop 40 rows from the virginica class to create imbalance, then use upSample() to rebalance and confirm all three species reach 50 rows.

RYour turn: rebalance virginica

library(caret) set.seed(11) ex_iris <- iris[-sample(which(iris$Species == "virginica"), 40), ] table(ex_iris$Species) set.seed(11) ex_bal <- # your code here table(ex_bal$Class) #> Expected: setosa 50, versicolor 50, virginica 50

Click to reveal solution

RSolution

library(caret) set.seed(11) ex_iris <- iris[-sample(which(iris$Species == "virginica"), 40), ] table(ex_iris$Species) #> #> setosa versicolor virginica #> 50 50 10 set.seed(11) ex_bal <- upSample(x = ex_iris[, 1:4], y = ex_iris$Species) table(ex_bal$Class) #> #> setosa versicolor virginica #> 50 50 50

Explanation: setosa and versicolor already have 50 rows so they are passed through untouched. virginica jumps from 10 to 50 via sampling with replacement, leaving the balanced frame with 150 rows and a uniform class distribution.

upSample() is one piece of caret's class-balancing toolkit.

downSample(x, y): random undersampling of the majority class to match the minority size.
createDataPartition(y, p = 0.7): stratified train/test split, runs before any sampling.
trainControl(sampling = "up"): runs upSample() inside each resample during train().
train(method, weights = w): class-weighted fitting without resampling.
confusionMatrix(predicted, actual): scores a classifier on the natural-prevalence test set.

See the caret documentation on subsampling for class imbalance for the authoritative reference.

FAQ

Why does upSample() rename my outcome column to Class?

The default yname = "Class" makes the returned frame work with caret's formula interface, where train(Class ~ ., data = balanced) finds the outcome by name. Pass yname = "Species" to keep the original name. Renaming is purely cosmetic; values are identical to the input factor.

Should I oversample the validation set too?

No. Oversample only the training partition. The validation and test sets should reflect the real-world class prior so metrics generalise to production. Scoring on a balanced validation set inflates accuracy and obscures the precision-recall trade-off at deployment.

How does upSample() compare with SMOTE?

upSample() duplicates existing minority rows, so the model sees the same points many times. SMOTE generates synthetic rows by interpolating between minority neighbours, expanding the predictor space in a way that can hurt high-cardinality categorical features. Start with upSample() for a baseline; switch to SMOTE only when interpolation is plausible.

Can I combine upSample() with cross-validation?

Yes, by passing trainControl(sampling = "up") to train(). caret then upsamples inside each CV fold so the resample estimates are honest. Calling upSample() on the full training set before train() leaks duplicated rows across folds.

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

caret upSample() in R: Oversample Minority Class for Balance

What upSample() does in one sentence

upSample() syntax and arguments

upSample() examples by use case

upSample() vs downSample() and other balancing tools

Common pitfalls

Try it yourself

FAQ

Navigate

Tidyverse packages

Deep dives

Wrangling & EDA

Statistics

Machine Learning

Time Series

By Industry

Reporting & Apps

Levels

caret upSample() in R: Oversample Minority Class for Balance

What upSample() does in one sentence

upSample() syntax and arguments

upSample() examples by use case

upSample() vs downSample() and other balancing tools

Common pitfalls

Try it yourself

Related caret functions

FAQ