parsnip predict() in R: Score New Data With a Fit
The parsnip predict() function in R scores new data from a fitted model and returns a tidy tibble. You pass a model_fit object and a new_data frame, and predict() returns one row per input row with standardized .pred columns.
predict(model_fit, new_data = df) # default prediction predict(model_fit, new_data = df, type = "class") # predicted class label predict(model_fit, new_data = df, type = "prob") # class probabilities predict(model_fit, new_data = df, type = "conf_int") # confidence interval predict(model_fit, new_data = df, type = "pred_int") # prediction interval predict(model_fit, new_data = df, type = "raw") # raw engine output augment(model_fit, new_data = df) # predictions joined to data
Need explanation? Read on for examples and pitfalls.
What predict() does
predict() turns a fitted model into predictions on new data. It takes a model_fit object produced by fit() and a data frame of rows to score, then returns a tibble of predicted values. The parsnip method behind it is predict.model_fit().
The headline feature is consistency. Base R prediction methods each return a different shape: lm gives a named vector, glm gives probabilities, ranger gives a list. parsnip wraps all of them so predict() always returns a tibble with the same number of rows as new_data, in the same order.
That tidy output is what makes the result safe to bind back onto your data with bind_cols(). No row scrambling, no length mismatches, no guessing which column holds the answer.
new_data exactly, you can column-bind the result straight onto the input frame. This single guarantee removes the most common class of prediction bugs in base R.predict() syntax and arguments
predict() needs a fitted model and new data, with type controlling the output. Every other argument is optional.
The object argument is the result of fit(). The new_data argument holds the rows you want scored, and it must contain every predictor column the model was trained on. When type is NULL, parsnip picks a sensible default: "numeric" for regression and "class" for classification.
The type argument decides which prediction you get. Setting type = "prob" returns class probabilities, while type = "conf_int" returns interval bounds. The column names in the returned tibble follow from the type, so the output is predictable before you run it.
Predict from a model: four examples
Each example below uses a built-in R dataset. The mtcars data drives regression, and a factor version of its am column drives classification, so the code runs anywhere with no downloads.
Example 1: Predict numeric values
For a regression fit, predict() returns a single .pred column. This is the default, so no type argument is needed.
The result is a five-row tibble with one .pred column, matching the five rows passed in. Predicted fuel economy falls as weight and horsepower rise, which tracks the data.
Example 2: Predict the class label
For classification, predict() returns a .pred_class factor. The fitted spec must be a classifier and the outcome a factor.
The .pred_class column holds the predicted label for each row. parsnip applies the standard 0.5 probability cutoff to turn the model output into a class.
Example 3: Predict class probabilities
Set type = "prob" to get one probability column per class. The columns are named .pred_<level> using the factor levels.
Each row's probabilities sum to one. Use this output when you need a score rather than a hard label, for example to set a custom decision threshold or to compute roc_auc().
Example 4: Join predictions back to the data
Because predict() preserves row order, bind_cols() lines predictions up with actuals. This is the standard way to build a results table for plotting or scoring.
The augment() function does this same join in one call and also adds a .resid column for regression. Use bind_cols() when you want only the prediction, and augment() when you want the full annotated frame.
Prediction types in parsnip
The type argument is the same across every engine, which is the whole point of parsnip. A decision_tree() and a rand_forest() accept the identical type values even though their engines differ.
| type | Returns columns | Works with |
|---|---|---|
"numeric" |
.pred |
Regression models |
"class" |
.pred_class |
Classification models |
"prob" |
.pred_<level> |
Classification models |
"conf_int" |
.pred_lower, .pred_upper |
Regression, engine permitting |
"pred_int" |
.pred_lower, .pred_upper |
Regression, engine permitting |
"raw" |
Engine-native output | Any model |
The decision rule is simple. Use "numeric" or "class" for point predictions, "prob" when you need scores, and the interval types when you need uncertainty. Reach for "raw" only when an engine offers something parsnip does not standardize, since "raw" gives back the unwrapped engine result.
"conf_int" need an engine that can produce them, and predict() raises a clear error if you ask for an unsupported type. The lm engine supports intervals; many tree engines do not.Common pitfalls
Two mistakes catch most newcomers to predict(). Each one below shows the problem and the fix.
The most common is naming the data argument data instead of new_data. Base R predict() methods use newdata, and parsnip deliberately renamed it to new_data, so the wrong name fails.
The second pitfall is a new_data frame missing a predictor column. parsnip checks that every column used in the training formula is present, and stops with an error naming the missing variable. Build new_data with the same columns as the training data, even if some are not the outcome.
predict() reports that the type is not supported, check that your spec set set_mode("classification") and the outcome column is a factor.Try it yourself
Try it: Fit a linear_reg() model with the lm engine predicting mpg from disp on mtcars, then predict the first three rows. Save the prediction tibble to ex_pred.
Click to reveal solution
Explanation: fit() trains the model and predict() scores the three rows. The result is a tibble with one .pred column and three rows, matching the rows passed to new_data.
Related parsnip functions
predict() is the scoring step of the parsnip workflow. These functions cover the neighboring steps in a tidymodels project.
fit()trains a model specification and returns themodel_fitthatpredict()consumes.augment()predicts and binds the result ontonew_datain one call.set_mode()sets classification or regression, which decides the default prediction type.last_fit()fits on the training split and predicts the test split together.extract_fit_engine()pulls the raw engine object when you need its native predict method.
FAQ
What does parsnip predict() return?
predict() returns a tibble, never a bare vector. It has the same number of rows as new_data and keeps the original row order, so you can bind it back onto your data safely. The column names depend on the prediction type: .pred for regression, .pred_class for class labels, and .pred_<level> columns for probabilities. This consistent shape is the main reason parsnip wraps the many different base R predict methods.
What is the difference between data and new_data in predict()?
parsnip predict() uses the argument name new_data, with an underscore. Base R methods use newdata and some functions use data, so it is easy to type the wrong one. If you pass data =, parsnip does not recognize it and reports that new_data is missing. Always name the scoring frame new_data when calling the parsnip method.
How do I get class probabilities from predict()?
Pass type = "prob" to predict(). The model must be a classification spec, meaning its mode is "classification" and the outcome column is a factor. The result has one column per class, named .pred_<level>, and each row's probabilities sum to one. Use probability output to compute metrics like roc_auc() or to apply a custom decision threshold instead of the default 0.5 cutoff.
Why does predict() say a column is missing?
parsnip checks that new_data contains every predictor used in the training formula. If a column is absent, predict() stops and names the missing variable. Build new_data with the same predictor columns as the training data. The outcome column does not need to be present, but every predictor does, and the column types should match the training data.
Can I predict from a workflow the same way?
Yes. When you fit() a workflow, the result is a fitted workflow, and predict() works on it identically. The workflow first applies its recipe or formula preprocessing to new_data, then runs the model prediction. You do not re-apply the recipe yourself, which is one of the main reasons to wrap a model and preprocessing in a workflow.
For the full argument reference, see the parsnip predict() documentation.