recipes step_nzv() in R: Drop Near-Zero-Variance Predictors
The recipes step_nzv() function in R removes near-zero-variance predictors, columns that are overwhelmingly one value but not perfectly constant, before they can destabilize a model.
step_nzv(rec, all_predictors()) # screen every predictor step_nzv(rec, all_numeric_predictors()) # numeric predictors only step_nzv(rec, x, y) # named columns only step_nzv(rec, all_predictors(), freq_cut = 5) # stricter frequency rule step_nzv(rec, all_predictors(), unique_cut = 20) # stricter uniqueness rule prep(rec); bake(rec, new_data = NULL) # train, then apply tidy(prep(rec), number = 1) # list removed columns
Need explanation? Read on for examples and pitfalls.
What step_nzv() does
step_nzv() deletes columns that carry almost no information because one value dominates them. A column where 996 of 1,000 rows read "no" technically has variance, so step_zv() leaves it alone. Yet that column is nearly useless to a model and can break a cross-validation fold the moment the rare value disappears from a resample.
step_nzv() screens the columns you select while the recipe is prepared. During prep() it measures two quantities for each column and flags a column as near-zero variance only when both criteria are met. bake() then drops every flagged column from training data and new data alike.
freq_cut, and the percentage of distinct values must fall below unique_cut. A column that trips just one test survives.step_nzv() syntax and arguments
step_nzv() is a selection step controlled by two numeric thresholds. You name columns directly or use tidyselect helpers such as all_predictors() or all_numeric_predictors(), then optionally tune how aggressive the filter is.
The removals slot is filled automatically at prep() time with the names of the flagged columns. freq_cut defaults to 95/5, so a column is suspicious when its top value is at least 19 times more common than its runner-up. unique_cut defaults to 10, meaning the column must also have fewer than 10 percent distinct values. The full reference lives in the recipes documentation.
step_nzv() examples
The pattern is always prep then bake. prep() measures each column and decides what to remove; bake() applies that decision. The examples below build a 200-row frame where one predictor is heavily skewed.
The rare column is 196 "no" against 4 "yes", a frequency ratio of 49 and only 1 percent distinct values, so both thresholds fire and the column is dropped. age spreads across many values and survives. To confirm exactly what was removed, call tidy() on the prepared recipe.
The terms column lists every flagged predictor; an empty tibble means nothing tripped both thresholds. The freq_cut argument controls how skewed a column must be before it counts.
At the default cutoff the mostly column survives because its 9-to-1 split is not extreme enough. Lowering freq_cut to 4 makes the filter stricter and the column is dropped. step_nzv() is also valuable right after step_dummy(), where rare factor levels produce skewed indicator columns.
step_dummy() turns plan into plan_pro, a column of 195 zeros and 5 ones. That column has variance, so step_zv() would keep it, but step_nzv() recognizes the 39-to-1 skew and removes it before it reaches the model.
VarianceThreshold removes only strictly low-variance features; the closest match to step_nzv() is a custom filter on value-frequency ratios, which caret originally provided through nearZeroVar().step_nzv() vs step_zv()
Both steps remove uninformative columns, but they apply different tests. Picking the wrong one either keeps near-dead columns or, worse, deletes useful predictors.
| Step | Removes | Decided from |
|---|---|---|
step_zv() |
Columns with exactly one unique value | Count of distinct values |
step_nzv() |
Columns that are nearly constant | freq_cut ratio and unique_cut percentage |
step_corr() |
Columns highly correlated with another | Pairwise correlation matrix |
Use step_zv() as a cheap, guaranteed-safe filter for strictly constant columns. Reach for step_nzv() when a column is overwhelmingly one value but not perfectly constant, such as a rare-event flag. Many pipelines run step_zv() first, then step_nzv() for the borderline cases.
Common pitfalls
Three mistakes account for most step_nzv() confusion.
- Expecting it to fire on small datasets. With only 5 rows, even 1 distinct value is 20 percent unique, above the default
unique_cutof 10. The filter rarely triggers until you have a few hundred rows or you raiseunique_cut. - Assuming one threshold is enough. A column needs both a high frequency ratio and a low unique percentage. A skewed numeric column with many distinct values passes the
unique_cuttest and survives. - Treating the removal list as fixed. Near-zero variance is judged from the data given to
prep(). Under resampling, a column flagged in one fold may be kept in another.
tidy() to confirm what was dropped.Try it yourself
Try it: Build a recipe on the frame below, add step_nzv() to all predictors, and bake the training data. The signup column is 198 "free" against 2 "paid" and should disappear. Save the result to ex_baked.
Click to reveal solution
Explanation: signup has a 99-to-1 frequency ratio and 1 percent distinct values, so both thresholds fire during prep() and bake() drops it. visits spreads across many values and stays.
Related recipes functions
step_nzv() is one filter in a larger column-cleaning toolkit. These steps often appear together in the same recipe:
step_zv()removes strictly constant columns with exactly one unique value.step_corr()drops predictors highly correlated with another column.step_lincomb()removes columns that form exact linear combinations of others.step_rm()deletes specific columns you name by hand.step_dummy()converts factors into indicator columns, often paired beforestep_nzv().
FAQ
What does step_nzv() do in R?
step_nzv() is a recipes step that removes near-zero-variance predictors, columns that are overwhelmingly one value but not perfectly constant. During prep() it measures each selected column's frequency ratio and percentage of unique values. A column is flagged only when the frequency ratio exceeds freq_cut and the unique percentage falls below unique_cut. bake() then drops the flagged columns from training and new data alike.
What is the difference between step_zv() and step_nzv()?
step_zv() removes only strictly constant columns, those with exactly one unique value. step_nzv() is more aggressive: it removes columns that are almost constant, judged by a frequency ratio and a unique-value percentage. A column with 999 identical entries and one different entry survives step_zv() but is caught by step_nzv(). Use step_zv() for truly dead columns and step_nzv() for skewed ones.
What do freq_cut and unique_cut mean?
freq_cut is the ratio of the most common value's frequency to the second most common value's frequency; the default 95/5 equals 19. unique_cut is the percentage of distinct values out of all rows, defaulting to 10. A column is flagged as near-zero variance only when its frequency ratio is above freq_cut and its unique percentage is below unique_cut. Lower freq_cut or raise unique_cut to make the filter stricter.
Why did step_nzv() not remove any columns?
The most common reason is a small dataset. With few rows, the percentage of distinct values stays above the default unique_cut of 10, so the second criterion never passes. Either test more data or raise unique_cut. The other reason is that a column trips only one threshold; both the frequency ratio and the unique percentage must qualify before a column is removed.