parsnip cubist_rules() in R: Rule-Based Regression Models
The parsnip cubist_rules() function defines a Cubist rule-based regression model for tidymodels. It gives you one interface for a model that splits the data into rules and fits a separate linear regression inside each one, through the Cubist engine.
cubist_rules() # default spec, Cubist engine cubist_rules() |> set_mode("regression") # the only supported mode cubist_rules(committees = 10) # boost with 10 committees cubist_rules(neighbors = 5) # instance-based correction cubist_rules(max_rules = 20) # cap the number of rules cubist_rules() |> set_engine("Cubist") # name the engine explicitly fit(spec, mpg ~ ., data = mtcars) # train on a dataset
Need explanation? Read on for examples and pitfalls.
What cubist_rules() does
cubist_rules() is a model specification, not a fitted model. It records your choice of a Cubist rule-based regression model and its hyperparameters, but no data touches it until you call fit(). This separation lets you reuse one specification across many datasets or resampling folds.
A Cubist model first grows a tree, then collapses that tree into a flat list of if-then rules. What sets it apart is the leaf: instead of predicting a single constant, each rule holds its own linear regression model. A prediction is the output of whichever rule the new row matches.
The function belongs to the tidymodels framework. Because parsnip standardizes the interface, cubist_rules() shares the same fit() and predict() verbs used by every other parsnip model.
cubist_rules() syntax and arguments
cubist_rules() takes three hyperparameters and two setup verbs. The arguments shape how many rule committees the model builds and how predictions are adjusted, while set_engine() and set_mode() finish the specification.
The committees argument sets how many sequential rule sets the model builds, much like boosting iterations, where each committee corrects the errors of the last. The neighbors argument turns on instance-based correction: at prediction time the model nudges its estimate toward the known outcomes of similar training rows. The max_rules argument caps how many rules survive, so a smaller value gives a simpler model.
library(rules) alongside library(tidymodels) to register the function. The fitting itself happens in the Cubist engine package, so install Cubist before you call fit() or R reports that the engine is not available. When the hyperparameters are left NULL, the Cubist engine picks sensible defaults from the data.Fit a Cubist model: four examples
Every example below uses the built-in mtcars dataset. Cubist supports regression only, so all four examples predict the numeric mpg column and the code runs anywhere with no downloads.
Example 1: Fit a regression Cubist on mtcars
Build the specification, then fit it to data. Leaving the hyperparameters unset lets the Cubist engine choose the rules and their linear models.
Squaring the correlation between predicted and actual mpg gives a training R-squared near 0.87. On data this small Cubist forms a single rule, so the result is one linear model over the strongest predictors.
Example 2: Predict mpg for new rows
predict() returns a tidy tibble with one row per input row. Each prediction comes from the linear model inside the matching rule.
The .pred column holds the predicted miles per gallon as a number. The output keeps the same row order as the input, so you can bind it back to sample_rows with bind_cols().
Example 3: Boost accuracy with committees
Raise committees and Cubist builds several rule sets in sequence. Each new committee focuses on the rows the previous ones predicted poorly.
With ten committees the training R-squared climbs to about 0.91. The committees behave like boosting rounds, so accuracy usually rises before it plateaus.
Example 4: Adjust predictions with neighbors
Set neighbors and Cubist corrects each prediction using similar training rows. The model blends its rule-based estimate with the outcomes of the nearest neighbors.
With neighbors = 5, each prediction shifts toward the average outcome of the five most similar cars. This often sharpens accuracy when the rule-based estimate alone is slightly off.
neighbors values of 0, 5 and 9 usually finds a better model than tuning either argument on its own.cubist_rules() vs boost_tree() vs decision_tree()
parsnip ships several tree-based ways to fit a numeric target. They share the same verbs, so swapping between them is a one-line change.
| Function | What each leaf predicts | Output style | Use when |
|---|---|---|---|
decision_tree() |
One constant value | A single readable tree | You need a simple, explainable model |
cubist_rules() |
A linear regression model | A flat list of rules | Rules plus sloped predictions help |
boost_tree() |
Many trees summed | An ensemble score | Raw accuracy matters most |
Start with decision_tree() when explainability is the goal, reach for cubist_rules() when you want rules that still respond smoothly to the predictors, and use boost_tree() when only predictive accuracy counts.
Common pitfalls
Three mistakes catch most newcomers to cubist_rules(). Each one below shows the problem and the fix.
The most common is asking for classification. Cubist is a regression-only algorithm, so set_mode("classification") fails. Reach for C5_rules() from the same rules package when you need rule-based classification.
The second pitfall is forgetting library(rules). Because cubist_rules() lives in the rules extension package, a plain library(tidymodels) does not expose it and R reports that the function is not found. The third is passing a neighbors value outside 0 to 9, since the Cubist engine only accepts that range and rejects anything larger.
mtcars the model can start memorizing noise. Keep committees modest and tune it across resampling folds with vfold_cv() rather than trusting training R-squared.Try it yourself
Try it: Fit a Cubist regression model on mtcars with committees = 5, then predict mpg for the first row. Save the prediction to ex_pred.
Click to reveal solution
Explanation: Setting committees = 5 builds five sequential rule sets that each correct the last. Row 1 of mtcars is the Mazda RX4, whose true mpg is 21, so the Cubist prediction lands close.
Related parsnip functions
cubist_rules() works alongside the rest of the parsnip model family. These functions cover the neighboring tasks in a tidymodels project.
C5_rules()defines a C5.0 rule-based model for classification problems.rule_fit()defines a RuleFit model that mixes ensemble rules with a lasso fit.decision_tree()defines a single tree of axis-aligned splits.boost_tree()defines a gradient-boosted ensemble of trees.set_engine()chooses the computational backend for any specification.
FAQ
What package is cubist_rules() in?
cubist_rules() ships in the rules package, a parsnip extension, so you need library(rules) in addition to library(tidymodels). The function only describes the model; the actual fitting happens in an engine package. The standard registered engine is Cubist, which implements Quinlan's Cubist algorithm, so install the Cubist package separately before you call fit().
What is the difference between cubist_rules() and decision_tree()?
decision_tree() predicts a single constant value in each leaf, so its surface is a set of flat steps. cubist_rules() instead fits a separate multiple linear regression inside every rule, so predictions still slope with the predictors within a rule. Cubist also builds committees, a boosting-style ensemble of rule sets. Choose decision_tree() for a simple, readable model and cubist_rules() when sloped, rule-based predictions improve accuracy.
Does cubist_rules() support classification?
No. Cubist is a regression-only algorithm, so cubist_rules() accepts only set_mode("regression") and fit() fails on a factor outcome. For rule-based classification, use C5_rules() from the same rules package, which wraps the C5.0 algorithm.
What do committees do in cubist_rules()?
The committees argument sets how many rule sets Cubist builds in sequence. The first committee fits the data, and each later committee adjusts its training targets to focus on the rows the previous committees predicted poorly, much like boosting rounds. More committees usually raise accuracy until the gain plateaus.
How do I tune cubist_rules() hyperparameters?
Mark the arguments with tune(), as in cubist_rules(committees = tune(), neighbors = tune()), then pass the specification to tune_grid() with a resampling object such as vfold_cv(). The framework scores a grid of committee counts and neighbor values with cross-validation. Use select_best() to pick the winner, then finalize_workflow() to lock the values before the final fit.
For the full argument reference, see the rules cubist_rules() documentation.