recipes step_novel() in R: Handle Unseen Factor Levels
The recipes step_novel() function adds a reserved placeholder level to a factor column so categories never seen during training do not break predictions on new data.
step_novel(rec, var) # add a "new" placeholder level step_novel(rec, var, new_level = "unseen") # custom placeholder name step_novel(rec, all_nominal_predictors()) # apply to every factor column recipe(...) |> step_novel(x) |> step_dummy(x) # use before dummy encoding prep(rec); bake(rec, new_data = test) # train, then apply to new data
Need explanation? Read on for examples and pitfalls.
What step_novel() does
step_novel() reserves a new factor level for categories that appear only at prediction time. When you train a model on data where a factor column has the levels Chicago, LA, and NYC, the model has no representation for a row that later arrives with Boston. That mismatch produces missing values or an outright error during scoring.
step_novel() solves this by adding one extra level (named "new" by default) to the factor while the recipe is being prepared. Any value in fresh data that was not present in the training set is mapped to that reserved level instead of becoming NA. The model then treats every unseen category as a single known group.
step_novel() syntax and arguments
The function takes a recipe, one or more columns, and an optional name for the placeholder. It is a selection step, so you can name columns directly or use tidyselect helpers like all_nominal_predictors().
The argument you will change most often is new_level. It must be a string that does not already exist among the column's training levels, otherwise prep() raises an error. Everything else can stay at its default for typical workflows. The full reference lives in the recipes documentation.
step_novel() examples
The pattern is always prep then bake. prep() learns the training levels; bake() applies the mapping to any new data frame. The examples below build a small housing-price dataset and process it step by step.
Boston and Denver were absent from training, so both collapse into the new level while NYC passes through unchanged. Without step_novel(), those two rows would be NA.
You can rename the placeholder when "new" could clash with a real category or when you want a clearer label in reports.
The most common production use is pairing step_novel() with step_dummy(). The novel step must run first so the dummy encoder has a column to represent unseen categories.
The city_new indicator carries every unseen city. A model fit on these columns can score the Boston and Denver rows without complaint.
To guard several factor columns at once, swap the bare column name for a tidyselect helper.
OneHotEncoder(handle_unknown="ignore"), which silently zeroes unseen categories. step_novel() differs by giving them an explicit, modelable level instead of dropping them.step_novel() vs step_unknown() vs step_other()
Three recipes steps touch factor levels, but each fixes a different problem. Picking the wrong one either misses the case you care about or distorts data you wanted to keep.
| Step | Problem it solves | When the level is decided |
|---|---|---|
step_novel() |
Category unseen during training appears later | At bake() time, on new data |
step_unknown() |
Factor value is missing (NA) |
At bake() time, on any data |
step_other() |
Infrequent training levels inflate the model | At prep() time, from training counts |
Use step_novel() for the future-proofing case: production data will contain categories your training sample never had. Use step_unknown() when missing factor values need an explicit bucket. Use step_other() to collapse rare-but-known levels before they add noise. Many pipelines chain all three.
Common pitfalls
Three mistakes account for most step_novel() bugs.
- Placing it after
step_dummy(). The dummy step encodes whatever levels exist when it runs. Ifstep_novel()runs later, thenewcolumn never gets created and unseen rows still break. Always putstep_novel()first. - Reusing an existing level name. Passing
new_level = "NYC"whenNYCis already a training level makesprep()fail. Choose a label that cannot appear in the data. - Expecting it to work on character columns.
step_novel()operates on factors. A character column is not selected byall_nominal_predictors()the same way; convert it withstep_string2factor()first.
new level. Code that counts or loops over factor levels downstream will see one extra entry.Try it yourself
Try it: Build a recipe on the training frame below, add step_novel() to the team column, and bake a new data frame containing the unseen team "Hawks". Save the result to ex_baked.
Click to reveal solution
Explanation: prep() records the training levels Bears and Lions plus the reserved new level. Because Hawks was never in training, bake() maps it to new.
Related recipes functions
step_novel() is one piece of a factor-handling toolkit. These steps commonly appear in the same recipe:
step_unknown()assigns a placeholder to missing factor values.step_other()collapses infrequent training levels into a singleotherlevel.step_dummy()converts factors into 0/1 indicator columns.step_string2factor()turns character columns into factors so factor steps can select them.step_zv()drops columns with only one observed value.
FAQ
What value does step_novel() assign to unseen levels?
By default it assigns the string "new". Every category in fresh data that was not present during prep() is mapped to this reserved level. You can rename it with the new_level argument, for example step_novel(x, new_level = "unseen"). The chosen name must not already exist among the column's training levels, or prep() will stop with an error.
Does step_novel() change the training data?
It adds the reserved level to the factor's level set, but no training row is relabeled because every training value was, by definition, seen during training. Baked training output therefore contains an extra empty level. The actual relabeling only happens when bake() meets a category that the recipe never recorded.
Should step_novel() come before or after step_dummy()?
Before. step_dummy() builds indicator columns from the levels that exist when it runs. If step_novel() runs first, the dummy encoder includes a _new column that can absorb unseen categories. Reversing the order leaves no place for novel values and reintroduces the missing-data problem you were trying to fix.
What is the difference between step_novel() and step_other()?
step_other() works on the training data: it pools levels that appear too rarely into an other bucket, decided from training frequencies during prep(). step_novel() works on future data: it reserves a level for categories that training never saw at all. One reduces known noise, the other future-proofs against unknown categories.
Can step_novel() handle several columns at once?
Yes. Pass multiple column names separated by commas, or use a tidyselect helper such as all_nominal_predictors() to cover every factor predictor in the recipe. Each selected column gets its own reserved level, so a novel value in one column does not affect the others.