caret dummyVars() in R: One-Hot Encode Categorical Data
The dummyVars() function in caret turns categorical predictors into numeric indicator columns, a process called one-hot encoding. It defines the encoding once and replays it on any data with predict(), keeping train and test columns aligned.
dv <- dummyVars(~ ., data = df) # define the encoder predict(dv, newdata = df) # apply it -> numeric matrix dummyVars(~ ., data = df, fullRank = TRUE) # drop the reference level dummyVars(~ color, data = df) # encode one column only predict(dv, newdata = test) # apply to unseen data as.data.frame(predict(dv, newdata = df)) # matrix -> data frame
Need explanation? Read on for examples and pitfalls.
What dummyVars() does in one sentence
dummyVars() defines an encoder; predict() applies it. You pass it a formula and a data frame, and it returns a dummyVars object that records every factor level it saw. The data itself is not transformed yet. Numeric columns are left alone, and each factor column is marked for expansion into one indicator column per level.
To produce the encoded numbers, you call predict() on that object. This two-step split matters because a model trained on encoded data must see new data encoded with the same set of columns. Storing the encoding rule once and replaying it everywhere is what stops a missing or extra category from silently breaking your test set.
dummyVars() syntax and arguments
dummyVars() reads a formula and a data frame, nothing else is required. The formula picks which columns to consider, and the data frame supplies the factor levels.
The main arguments are:
formula: use~ .for every column, or~ col1 + col2to encode a subset.data: the data frame used to learn the factor levels.fullRank: ifTRUE, drop one level per factor (the reference level) to avoid collinearity. Defaults toFALSE.sep: the separator between the variable name and the level in output column names. Defaults to".".levelsOnly: ifTRUE, name columns by the level alone and drop the variable prefix.
dummyVars() plus predict() is pd.get_dummies(df), and fullRank = TRUE matches drop_first=True.dummyVars() examples by use case
Encode every column at once with the ~ . formula. Numeric columns pass straight through, and each factor expands into one indicator per level.
The wool factor with two levels became wool.A and wool.B; tension with three levels became three columns. The numeric breaks column is untouched.
Use fullRank = TRUE to drop the reference level. This produces k - 1 columns per factor, which is what linear and logistic regression expect.
fullRank = FALSE, the k dummy columns of a factor sum to 1, so they are perfectly collinear with the intercept. Always set fullRank = TRUE for lm() or glm().Encode a single column by naming it in the formula. Only the columns on the right side of ~ appear in the output.
Apply a fitted encoder to new rows. Because the dummyVars object stores every level, new data always comes back with the same columns in the same order.
dummyVars() vs model.matrix() and recipes
dummyVars() is one of three common ways to build a numeric design matrix. The right choice depends on whether you want a saved, reusable encoder and how you handle the reference level.
| Tool | Reference level | Output type | Best for |
|---|---|---|---|
dummyVars() |
Kept unless fullRank = TRUE |
Numeric matrix | Tree models, reusable encoders |
model.matrix() |
Always dropped, adds an intercept | Numeric matrix | Quick design matrix for lm() |
recipes::step_dummy() |
Dropped by default | Step in a workflow | tidymodels pipelines |
Reach for model.matrix() when you just need a one-off matrix for a base R model. Reach for recipes::step_dummy() when the encoding is part of a tidymodels workflow that bundles preprocessing with the model. Reach for dummyVars() when you want a standalone object you can save, share, and replay on many data sets, especially with tree-based models that benefit from the full indicator set.
Common pitfalls
The output is a matrix, not a data frame. predict() on a dummyVars object returns a numeric matrix, so dplyr verbs and $ column access do not work until you convert it.
Numeric codes are not encoded. If a category is stored as numbers, such as grade 1, 2, 3, dummyVars() treats it as a numeric column and leaves it alone. Convert it to a factor first.
Unseen levels in new data become all-zero rows. If test data contains a category the encoder never saw, that row gets 0 for every dummy of that factor rather than a new column. Fit the encoder on the full training set so every level is captured.
Try it yourself
Try it: Use dummyVars to one-hot encode all columns of the iris data set, then save the encoded result to ex_encoded.
Click to reveal solution
Explanation: iris has four numeric columns that pass through unchanged plus the three-level Species factor, which expands into three indicator columns, giving 7 columns in total.
dummyVars object from your training set, then call predict() on both train and test. This guarantees identical columns and prevents leakage of test-set information.Related caret functions
These caret functions pair naturally with dummyVars() in a preprocessing pipeline:
- preProcess() centers, scales, and imputes numeric predictors.
- nearZeroVar() drops constant or near-constant columns.
- findCorrelation() removes highly correlated predictors.
- createDataPartition() splits data into train and test sets.
- train() fits and tunes a model on the encoded data.
FAQ
What is the difference between dummyVars() and model.matrix() in R?
Both build a numeric design matrix from factors, but they differ in two ways. model.matrix() always drops one reference level and adds an intercept column, so it is rank-correct for regression by default. dummyVars() keeps every level unless you set fullRank = TRUE, and it returns a reusable object you can predict() on later. Use model.matrix() for a quick one-off matrix and dummyVars() when you need a saved, replayable encoder.
Does dummyVars() drop a reference level?
Not by default. With fullRank = FALSE, a factor with k levels expands into k indicator columns. Set fullRank = TRUE to drop the first level and get k - 1 columns, which avoids the dummy variable trap. Tree-based models such as random forests are fine with the full set, but linear and logistic regression need fullRank = TRUE.
How do I convert the dummyVars() output to a data frame?
predict() on a dummyVars object returns a numeric matrix. Wrap it in as.data.frame() to get a data frame you can use with dplyr or base $ access. For example, encoded_df <- as.data.frame(predict(dv, newdata = df)). You can then cbind() it back to other columns or pass it directly to a modeling function.
Can dummyVars() handle new factor levels in test data?
It handles missing levels gracefully but not brand-new ones. If test data lacks a level the encoder saw, that column still appears, filled with zeros. If test data contains a level the encoder never saw, that observation gets zeros for every dummy of the factor and the new level is ignored. Always fit the encoder on a training set that covers all expected categories.