caret sbf() in R: Filter-Based Feature Selection
The caret sbf() function runs selection by filtering, a resampling-based feature selection method that scores each predictor with a univariate filter and keeps only those that pass before fitting a model.
sbf(x, y, sbfControl = sbfControl(functions = lmSBF)) # regression filter sbf(x, y, sbfControl = sbfControl(functions = rfSBF)) # classification filter sbfControl(functions = lmSBF, method = "cv", number = 5) # 5-fold resampling model$optVariables # variables kept predict(model, newdata) # filter then predict model$results # resampled performance
Need explanation? Read on for examples and pitfalls.
What sbf() does in one sentence
sbf() applies a univariate filter to every predictor, then fits a model on the survivors. The name stands for Selection By Filtering. For each predictor, a scoring function computes a statistic that measures the predictor's relationship with the outcome on its own. A filter rule then decides which predictors to keep, and the chosen modeling function fits only on that subset.
The important detail is where the filtering happens. sbf() repeats the entire filter-then-fit cycle inside each resampling fold, so the predictor set is re-selected on every training split. That keeps the performance estimate honest.
sbf() avoids that leak by design.sbf() syntax and arguments
The core call passes predictors, an outcome, and a control object. The most common form uses the x / y interface, though a formula interface also exists.
| Argument | Description |
|---|---|
x |
Predictor data frame or matrix |
y |
Outcome vector; numeric triggers a regression filter, a factor triggers a classification filter |
sbfControl |
Control object built by sbfControl() |
... |
Extra options passed to the modeling function |
The behaviour of sbf() is driven almost entirely by sbfControl(). Its key arguments:
| Argument | Description |
|---|---|
functions |
A filter-and-fit function set: lmSBF, rfSBF, caretSBF, nbSBF, ldaSBF, treebagSBF |
method |
Resampling scheme: "boot", "cv", "repeatedcv", "LOOCV" |
number |
Number of folds or bootstrap resamples |
repeats |
Repeats when method = "repeatedcv" |
saveDetails |
Keep the per-resample selections for inspection |
Each function set bundles a score, filter, fit, and pred function. lmSBF scores predictors with a linear-model p-value and suits numeric outcomes. rfSBF uses random forest importance and suits classification. The default filter keeps predictors with a p-value below 0.05.
A worked sbf() example
Start with data where you know the answer. The example below builds 8 predictors but only real1 and real2 actually drive the outcome. A good feature selector should recover those two and discard the noise.
With the data ready, build a control object and run sbf(). The lmSBF set fits a linear model, which matches the numeric outcome.
The summary reports two things at once. The selection on the full training set kept exactly real1 and real2. The resampling section shows how stable that choice was: both true predictors survived in 100% of folds, while each noise variable slipped through only occasionally by chance.
set.seed() makes the selected variables and the reported RMSE reproducible across runs.To score new observations, call predict(). It applies the same filter and then the fitted model in one step.
sbf() vs rfe(): choosing a feature selector
Use sbf() when predictors can be judged one at a time, and rfe() when they cannot. Both are caret wrappers that run feature selection inside resampling, but they search very differently.
| Aspect | sbf() |
rfe() |
|---|---|---|
| Strategy | Univariate filter, each predictor scored alone | Recursive elimination, predictors ranked together |
| Speed | Fast, one pass per predictor | Slower, refits across subset sizes |
| Redundancy | Ignores it; keeps correlated predictors | Handles it through joint importance |
| Best for | Quick screening, many predictors | Final subset selection when interactions matter |
The decision rule is short. Reach for sbf() as a fast first screen, especially with a wide predictor matrix. Reach for caret rfe() when predictors interact or are correlated and you need the smallest subset that holds accuracy.
Common pitfalls
The outcome type silently changes the filter. A numeric y runs a regression filter; a factor y runs a classification filter. Encoding classes as 0 and 1 leaves them numeric, so sbf() quietly applies the wrong filter.
sbf() will keep both. Pair it with findCorrelation() when collinearity is a concern.Two more traps to avoid. First, do not pre-filter predictors before calling sbf(); that reintroduces the selection bias the function exists to prevent. Second, the default p-value cutoff of 0.05 is a convention, not a law. With many predictors, a stricter threshold or a different functions set often yields a cleaner subset.
Try it yourself
Try it: Use sbf() with the lmSBF filter to select predictors of mpg from the mtcars dataset. Save the fitted object to ex_sbf.
Click to reveal solution
Explanation: Every mtcars predictor is individually correlated with mpg, so the univariate filter keeps all ten. That is the expected behaviour: sbf() scores one variable at a time and never removes redundant predictors.
Related caret functions
sbf() sits inside a wider caret feature-selection toolkit. These functions cover the tasks sbf() deliberately leaves out.
- caret rfe(): recursive feature elimination for subset search
- caret varImp(): rank predictors inside an already-fitted model
- caret nearZeroVar(): drop predictors with almost no variance
- caret findCorrelation(): remove highly correlated predictors
- caret preProcess(): centre, scale, and transform predictors
See the official caret feature selection documentation for the full list of sbfControl function sets.
FAQ
What does sbf stand for in caret?
sbf stands for Selection By Filtering. It is a feature selection wrapper in the caret package that screens predictors with a univariate filter. Each predictor is scored on its own relationship with the outcome, a filter rule decides which ones to keep, and a model is fitted on the survivors. The whole cycle runs inside resampling so the reported performance is not biased by the selection step.
What is the difference between sbf and rfe in caret?
sbf() filters predictors univariately: each one is judged alone, which is fast but blind to redundancy. rfe() runs recursive feature elimination, ranking predictors jointly and removing the weakest in repeated passes. Use sbf() as a quick screen on wide data, and rfe() when predictors interact or correlate and you need the smallest accurate subset.
Does caret sbf prevent selection bias?
Yes, when used as intended. sbf() re-runs the filter inside every resampling fold, so the held-out data never influences which predictors are selected. That produces an honest performance estimate. The bias returns only if you filter predictors manually before calling sbf(), which defeats the purpose of the wrapper.
Which sbfControl functions should I use?
Match the function set to the outcome. lmSBF suits numeric regression outcomes, while rfSBF, ldaSBF, and nbSBF suit classification. caretSBF is the most flexible because it delegates fitting to train(), and treebagSBF uses bagged trees. Start with lmSBF or rfSBF and switch only if a different model better fits your data.