recipes step_bs() in R: B-Spline Basis Expansion
The recipes step_bs() function builds a B-spline basis for a numeric predictor in a tidymodels recipe, expanding one column into several that let a model fit a smooth, flexible curve. A B-spline puts no constraint on the curve beyond its outer knots, so it stays free to bend near the edges of the data.
step_bs(rec, x) # default cubic B-spline, 3 df step_bs(rec, x, deg_free = 5) # 5 columns, more flexible step_bs(rec, x, degree = 2) # quadratic pieces, not cubic step_bs(rec, x, y, deg_free = 4) # expand several predictors step_bs(rec, x, keep_original_cols = TRUE) # keep the input column step_bs(rec, all_numeric_predictors()) # expand every numeric predictor
Need explanation? Read on for examples and pitfalls.
What step_bs() does
step_bs() expands one column into a B-spline basis. It is a recipe step from the recipes package that replaces a numeric predictor with several columns describing a smooth, piecewise-polynomial curve. A model fed those columns can bend its prediction without you constructing spline terms by hand.
A B-spline, short for basis spline, splits the predictor range at interior knots. On each interval the curve is a polynomial of a chosen degree, and neighbouring pieces join smoothly where they meet. Unlike a natural spline, a B-spline places no constraint on the curve beyond the outer knots, so it keeps full freedom at the edges of the data.
degree sets the polynomial order of each piece, cubic by default. deg_free sets how many columns the predictor expands into. Together they decide how flexible the fitted curve can be.step_bs() syntax and arguments
Most calls only set deg_free, and sometimes degree. You add step_bs() to a recipe pipeline after declaring variable roles with recipe().
The ... argument takes one or more predictors, named directly or through selectors like all_numeric_predictors(). deg_free controls how many spline columns each predictor produces. When it is left at NULL, step_bs() falls back to the value of degree, so a default call yields three columns.
degree sets the polynomial order on each interval. The default 3 gives cubic pieces, the most common choice; lower it to 2 or 1 for stiffer, simpler segments. options is a list forwarded to splines::bs(), where you can pass knots to place interior knots yourself instead of at the default quantiles. keep_original_cols defaults to FALSE, so step_bs() drops the input column and keeps only the basis.
step_bs() examples by use case
Start with the default cubic expansion. Load recipes, declare a recipe, add the step, then prep() and bake() to inspect the result.
The disp column is gone and three new columns take its place. They are named with the predictor, a _bs_ separator, and an index. Three columns appear because deg_free is NULL and degree is 3.
Raise deg_free for a more flexible curve. A higher value adds more spline columns and lets the fitted curve bend more often.
Five degrees of freedom produce five columns, whatever the polynomial degree. Each extra column buys local flexibility, at the cost of more parameters for the model to estimate.
Lower the degree for stiffer polynomial pieces. The degree argument changes the shape of each segment without changing the column count, which deg_free still controls.
Four columns appear because deg_free is 4. Setting degree = 2 makes each piece a quadratic rather than a cubic, so the curve is a little less wiggly between knots.
Expand several predictors in one call. List predictors in ... and step_bs() applies the same settings to each.
Each predictor becomes three columns, so two predictors at three degrees of freedom yield six new features.
step_bs() vs step_ns() and step_poly()
Three recipe steps add a non-linear basis, and they differ in how the curve behaves at the edges. step_bs() is the choice when you want maximum flexibility, including near the smallest and largest predictor values.
| Step | Basis | Edge behavior |
|---|---|---|
step_bs(x, deg_free = 5) |
B-spline | unconstrained, free to swing at the edges |
step_ns(x, deg_free = 4) |
natural cubic spline | linear beyond boundary knots, stable tails |
step_poly(x, degree = 4) |
global polynomial | one curve, high degree oscillates near min and max |
step_interact(~ a:b) |
product of predictors | not a curve, captures combined effects |
The practical contrast is at the boundaries. A B-spline is free to follow the data right up to the outer knots, which helps when the relationship really does curve near the extremes. A natural spline trades that freedom for a straight-line tail that extrapolates more safely. step_poly() fits a single global polynomial, simple but prone to oscillation at high degree.
SplineTransformer from scikit-learn, or the bs() B-spline term in patsy. Both build a B-spline basis you drop into a modeling pipeline, just as step_bs() feeds a tidymodels workflow.Common pitfalls
Most step_bs() surprises come from the dropped column and from over-flexing the basis. Three mistakes show up repeatedly.
First, expecting the original predictor to remain. Because keep_original_cols defaults to FALSE, a later step that references disp by name fails after step_bs() runs. Set keep_original_cols = TRUE or reorder the recipe so dependent steps run first.
Second, setting deg_free too high. A B-spline has no edge constraint, so a large deg_free can swing wildly near the extremes and chase noise. Start at 3 or 4 and raise it only if validation error improves.
Third, applying step_bs() to a predictor with very few unique values. splines::bs() cannot place interior knots when a column takes only two or three distinct values, and prep() raises an error. Check the distinct count before choosing deg_free.
Try it yourself
Try it: Build a recipe on mtcars predicting mpg from hp, expand hp into a B-spline with five degrees of freedom, and bake the data. Save the result to ex_baked.
Click to reveal solution
Explanation: A deg_free of 5 expands hp into five B-spline columns. step_bs() drops the original hp column because keep_original_cols defaults to FALSE.
Related recipes functions
These steps pair naturally with step_bs() in a preprocessing recipe. Each one handles a different feature-engineering need:
step_ns()builds a natural spline basis when you want stable, linear tails.step_poly()adds polynomial terms, a smoother global alternative to splines.step_interact()multiplies predictors into interaction terms when combined effects matter.step_normalize()centers and scales predictors, useful before other transformations.recipe()defines the variable roles every step operates on.
See the official step_bs() reference for the full argument list.
FAQ
What does step_bs() do in a recipes pipeline?
step_bs() replaces a numeric predictor with a B-spline basis: several columns that together describe a smooth, piecewise-polynomial curve. A model that receives those columns can fit a non-linear relationship while still estimating ordinary linear coefficients. The step drops the original predictor by default and appends the new columns after the outcome variable in the baked data.
What is the difference between step_bs() and step_ns()?
Both expand a predictor into a spline basis, but they differ at the edges. step_bs() builds an unconstrained B-spline that is free to curve right up to the outer knots. step_ns() builds a natural spline that is forced to be linear beyond its boundary knots. Use step_bs() when the relationship genuinely bends near the extremes, and step_ns() when you want safer, more stable extrapolation.
How many columns does step_bs() create?
The column count equals deg_free. When deg_free is left at its NULL default, step_bs() falls back to the value of degree, so a plain step_bs(x) call produces three columns. Setting deg_free = 5 always yields five columns, whatever the degree. The new columns are named with the predictor, a _bs_ separator, and an index running from 1.
What value of deg_free should I use with step_bs()?
Start with 3 or 4, which captures gentle curvature, and raise it only when cross-validation shows a real improvement. A B-spline has no edge constraint, so a large deg_free can overfit and swing near the extremes. When accuracy matters, treat deg_free as a tunable hyperparameter and let a resampling search pick it rather than fixing the value by eye.