caret classDist() in R: Distance to Class Centroids
The caret classDist() function in R computes the Mahalanobis distance from every sample to each class centroid, turning the geometry of your classes into a compact set of numeric predictors you can feed straight into a model.
classDist(x, y) # build a distance model from predictors and labels classDist(x, y, pca = TRUE) # decorrelate predictors with PCA first classDist(x, y, keep = 3) # retain only the first 3 PCA components classDist(x, y_numeric, groups = 5) # bin a numeric outcome into 5 pseudo-classes predict(dist_model, newdata) # log Mahalanobis distance to each centroid predict(dist_model, newdata, trans = I) # raw distance, skip the log transform
Need explanation? Read on for examples and pitfalls.
What classDist() does
classDist() is a feature-engineering helper, not a classifier. It splits the training data by class, computes each class centroid and covariance matrix, and stores them in a small model object. The companion predict() method then measures how far any new row sits from each of those centroids using Mahalanobis distance.
The output is one new numeric column per class. A row that clearly belongs to a class will have a small distance to that centroid and large distances to the others. Those distances encode class structure that a linear model cannot see on its own, which is why classDist features often help logistic regression and other simple learners pick up curved class boundaries.
Syntax and arguments
classDist() needs only predictors and labels to run. Everything else has a sensible default. The model is built with classDist() and queried with predict().
| Argument | Belongs to | Purpose |
|---|---|---|
x |
classDist |
Matrix or data frame of numeric predictors |
y |
classDist |
Factor of class labels, or a numeric outcome |
groups |
classDist |
Number of bins when y is numeric |
pca |
classDist |
Apply PCA before splitting by class |
keep |
classDist |
Number of PCA components to retain |
newdata |
predict |
New rows to score against the stored centroids |
trans |
predict |
Transform applied to each distance (log by default) |
Build a distance model and predict
Start by splitting your data, then fit on the training rows only. The iris dataset gives four numeric predictors and a clean three-class factor.
Each test row now has three numbers: its log Mahalanobis distance to the setosa, versicolor, and virginica centroids. The small dist.setosa values above show these first rows are setosa flowers sitting tightly inside that class.
Use the distances as model features
The distances make a strong nearest-centroid classifier on their own. Assign each row to the class with the smallest distance and compare against the true labels.
More often you keep all three columns and pass them to a real model as engineered features, alongside or instead of the raw predictors. Because the columns are already class-aware, a plain glm() or train() call tends to converge fast and generalize well.
train() resample over the combined frame so the new features are validated like any other.Decorrelate predictors with PCA
Set pca = TRUE when predictors are strongly correlated. Mahalanobis distance inverts each class covariance matrix, and that inversion is unstable when columns are nearly collinear. Rotating the data to principal components first removes the correlation and stabilizes the calculation.
The keep argument caps how many components feed the distance, which is a quick guard against high-dimensional, low-sample data where the covariance matrix would otherwise be singular.
Tune the transform and handle numeric outcomes
The default trans = log compresses the long right tail of raw distances. Pass trans = I to keep raw Mahalanobis values, or any function you like.
When y is numeric, classDist() bins it into groups roughly equal-sized classes and treats each bin as a pseudo-class, so the same distance machinery works for regression problems too.
classDist() vs other distance approaches
classDist() is the only option here that produces per-class features. The alternatives solve narrower problems.
| Approach | Output | Best for |
|---|---|---|
classDist() |
One distance column per class | Class-aware features for any model |
mahalanobis() |
One distance per row | Outlier scoring against a single centroid |
preProcess(method = "pca") |
Rotated components | Dimension reduction without class labels |
knn3() |
Class predictions | Direct classification, not features |
Reach for classDist() when you want the class structure available as ordinary numeric columns; reach for mahalanobis() when you only need a single distance and have no class split.
Common pitfalls
Three mistakes account for most classDist() errors. Each has a direct fix.
- Too few rows per class. Each class needs more rows than predictors, or its covariance matrix is singular and the inversion fails. Set
pca = TRUEwith a smallkeepvalue to shrink the dimension. - Non-numeric columns in
x. classDist() expects numeric predictors only. Run categorical columns throughdummyVars()first, or drop them. - Fitting on the full dataset. Building the model on rows you later score leaks information. Always fit on a training split and
predict()on held-out rows.
classDist() stops with a "system is computationally singular" message, you have collinear predictors or too few rows in a class. PCA with keep set low is the standard cure.Try it yourself
Try it: Build a classDist() model on the first 100 rows of iris, predict distances for the last 50 rows, and save the result to ex_dist.
Click to reveal solution
Explanation: classDist() fits on the first 100 rows and stores three class centroids. predict() then returns one log-distance column per class for the 50 held-out rows.
Related caret functions
Build a full feature-engineering pipeline by combining classDist() with these caret helpers:
preProcess()to center, scale, and PCA-transform predictorsdummyVars()to convert factors into numeric columns classDist() can usefindCorrelation()to drop redundant predictors before fittingnearZeroVar()to remove low-variance columnstrain()to resample and validate the engineered features
FAQ
What does caret classDist() actually return? classDist() returns a small model object holding each class centroid, its covariance matrix, and an optional PCA rotation. It does not return distances directly. You get the distances by calling predict() on the object with new data, which produces one numeric column per class.
Why are the distances logged by default? Raw Mahalanobis distances have a long right tail because far-away points produce very large values. The default trans = log compresses that tail so the columns behave better as model inputs. Pass trans = I if you want the raw distances instead.
When should I set pca = TRUE? Use pca = TRUE when your predictors are strongly correlated or when some classes have fewer rows than predictors. PCA rotates the data so the covariance matrix is easier to invert, and the keep argument lets you cap the number of components to avoid a singular matrix.
Can classDist() handle a numeric outcome? Yes. When y is numeric, classDist() splits it into groups roughly equal-sized bins and treats each bin as a class. The distance features then describe which part of the outcome range a row resembles, which can help regression models.
Is classDist() a classifier? No. classDist() is a feature-engineering tool. It produces distance columns you feed into a separate model. You can build a quick nearest-centroid classifier by picking the smallest distance per row, but that is a side use, not its purpose.