dials learn_rate() in R: Tune Boosting Learning Rate
The dials learn_rate() function in R defines the numeric hyperparameter for the per-iteration shrinkage applied in boosted tree and neural network models. It defaults to a log10-transformed range of 10^-10 to 10^-1, which is the search space almost every boosting library expects.
learn_rate() # default log10 range -10 to -1 learn_rate(range = c(-3, -1)) # narrower band, still log10 learn_rate(range = c(0.001, 0.3), trans = NULL) # raw scale instead of log update(params, learn_rate = learn_rate(c(-4, -1))) # override in param set grid_regular(learn_rate(c(-3, -1)), levels = 5) # 5 log-spaced points boost_tree(trees = tune(), learn_rate = tune()) # tune the pair together finalize(params, train) # no-op for learn_rate, still safe
Need explanation? Read on for examples and pitfalls.
What learn_rate() does in one sentence
learn_rate() returns a dials parameter object describing the shrinkage applied to each boosting iteration or the step size used in gradient descent. It is the knob you tune when you mark learn_rate = tune() inside boost_tree(), mlp(), or bart(). Smaller values force the model to take many small steps and usually need more trees to compensate. Larger values converge faster but risk overshooting the minimum and overfitting on the residual structure.
The function is part of the same dials family as trees(), tree_depth(), and min_n(). The one thing that sets it apart is the default log10 transform, which means the range you pass and the values the model actually sees live on different scales.
learn_rate() syntax and arguments
The signature is two arguments with a non-obvious default.
| Argument | Description |
|---|---|
range |
Two-element numeric vector. Defaults to c(-10, -1) on the log10 scale, i.e. 10^-10 to 10^-1 on the natural scale. |
trans |
A transformation from the scales package. Default transform_log10(). Pass NULL to search on the raw scale instead. |
The return is a quant_param S3 object. Print it to inspect the range, call value_seq() to draw points, or hand it to grid_*() helpers.
c(-10, -1) expands to [10^-10, 0.1]. Few boosting problems benefit from learning rates below 0.001, so most authors tighten the range to c(-3, -1) (0.001 to 0.1) before tuning.Examples by use case
Start with a tunable boosted tree, pair learn_rate with trees, then build a small grid.
Extract the parameter set and tighten the learn_rate range from the wide default to something practical.
A regular grid over the pair samples five learn_rate values on a log scale and three tree counts on a linear scale.
The grid's learn_rate column is on the natural scale, even though we passed the range on the log10 scale. dials handles the back-transform for you.
learn_rate() versus the raw boost_tree(learn_rate = 0.1) argument
Pick by whether you are tuning or fitting one model.
| Form | What it does | When to use |
|---|---|---|
learn_rate() |
Returns a parameter object for tune_grid() to sample | Hyperparameter tuning, parameter set construction |
boost_tree(learn_rate = 0.1) |
Fixes the rate at 0.1 for a single fit | Production, after tuning has chosen a value |
boost_tree(learn_rate = tune()) |
Marks the slot as tunable, leaves the range to dials | Inside a workflow you will pass to tune_grid() |
Behind the scenes, tune() is a placeholder. extract_parameter_set_dials() walks the spec, sees the placeholder, and asks dials for the default learn_rate() object. You only call learn_rate() directly when you want to override the default range or transform.
Common pitfalls
Four mistakes catch most learn_rate tuning runs in their first iteration.
- Reading the default range as natural-scale.
c(-10, -1)looks like a tiny set of rates; it is actually 10 orders of magnitude on the natural scale. A 5-point regular grid hits 10^-10, 10^-7.75, 10^-5.5, 10^-3.25, 10^-1, which is useless for any real model. - Tuning learn_rate without tuning trees. A fixed
trees = 100plus a small learn_rate yields a model that barely moves off the intercept. Always tune the pair together, or fix trees high enough (say 1500) and usestop_iter()to find the right ensemble size during fit. - Mixing log and raw scales in a custom grid. A tibble grid with
learn_rate = c(0.001, 0.01, 0.1)works only if the workflow's parameter hastrans = NULL. If the default log10 transform is still in place, dials interprets those values as exponents and silently samples 10^0.001, 10^0.01, 10^0.1. - Calling finalize() and expecting it to do something.
finalize()only fills inunknown()bounds. learn_rate has both bounds set by default, so finalize is a no-op. Harmless to call, but it is not the missing piece if your tune_grid still errors.
[0, 1]. lightgbm allows higher values up to 2 or 3 in practice. h2o has its own scaling. Always check the engine's accepted range before pushing the upper bound above -1 on the log10 scale.Try it yourself
Try it: Build a tunable boost_tree on mtcars (regression on mpg), tighten learn_rate to the practical range 0.005 to 0.2 with log10 transform, and produce a regular grid with 4 learn_rate values and 3 tree counts. Print the grid.
Click to reveal solution
Explanation: Wrapping the raw bounds in log10() keeps the parameter on the log10 scale so the four sampled values are spaced geometrically between 0.005 and 0.2. Crossed with three tree counts, the grid has 12 candidates.
Related tidymodels functions
learn_rate() rarely tunes alone; the typical call lives inside a small cluster.
trees()to set the ensemble size that pairs with it. Small learn_rate needs many trees.tree_depth()to control individual tree complexity in boosting.stop_iter()to halt boosting once validation stops improving, which protects you from oversized tree counts.loss_reduction()to tune the minimum gain required to split a node (gamma in xgboost).extract_parameter_set_dials()to pull every tunable parameter from a workflow at once.grid_regular(),grid_space_filling(),grid_max_entropy()to expand the parameter set into a candidate tibble.
External reference: the official dials documentation at dials.tidymodels.org.
FAQ
What is a good default learning rate for xgboost in R?
For most tabular regression and classification problems, a learn_rate between 0.01 and 0.05 paired with 500 to 2000 trees lands close to the optimum. Smaller rates like 0.005 work when the signal is subtle and you can afford the compute. Rates above 0.1 tend to overfit on small datasets and skip past the minimum on noisy data. xgboost's own default is 0.3, which is tuned for speed not accuracy; lower it before you tune anything else.
Why does learn_rate() use a log scale by default?
Because the practically useful range spans three to four orders of magnitude, from roughly 0.001 to 0.3. A linear grid would waste most candidates in the upper half of the range, where models overfit or fail to converge. A log10 grid spaces candidates geometrically, so a 5-point grid hits 0.001, 0.003, 0.01, 0.03, 0.1 instead of 0.001, 0.075, 0.15, 0.225, 0.3. The log transform makes the search efficient even when the optimal rate is close to the lower end.
How is dials learn_rate() different from setting learn_rate in xgboost directly?
dials learn_rate() builds a parameter object that tune_grid() can sample across many candidate values. Setting boost_tree(learn_rate = 0.1) fixes the rate at one value for a single fit. The dials version is what you reach for during hyperparameter search. The direct value is what you write into your production spec once tuning has chosen a winner.
Do I need finalize() with learn_rate() like I do with mtry()?
No. learn_rate() ships with both bounds set, so is_unknown() returns FALSE on both endpoints. finalize() only replaces unknown bounds with data-derived values, so calling it on learn_rate is harmless but does nothing. mtry() is different because its upper bound depends on the predictor count, which dials cannot know until you hand it data.
Can I tune learn_rate for a neural network instead of boosting?
Yes. mlp(learn_rate = tune()) with set_engine("keras") accepts the same dials learn_rate() parameter. The default log10 range still applies, though neural networks often want a narrower band like c(-4, -2). The mechanics are identical: extract the parameter set, optionally update the range, build a grid, fit through tune_grid().