caret findLinearCombos() in R: Detect Linear Dependencies
caret findLinearCombos() in R finds exact linear dependencies among predictor columns and tells you which ones to drop, so your model matrix is full rank before you fit a model.
findLinearCombos(x) # x must be a numeric matrix findLinearCombos(as.matrix(df)) # convert a data frame first findLinearCombos(x)$remove # column indices to drop findLinearCombos(x)$linearCombos # which columns form each dependency x[, -findLinearCombos(x)$remove] # drop the redundant columns findLinearCombos(scale(x)) # works on standardized data too
Need explanation? Read on for examples and pitfalls.
What findLinearCombos() does in one sentence
findLinearCombos() finds exact linear dependencies in a numeric matrix. It runs a QR decomposition to measure the matrix rank, then enumerates every set of columns that can be written as a linear combination of the others. A linearly dependent column carries no new information, so it makes the matrix rank-deficient and can cause model fitting to fail or return NA coefficients.
The return value has two parts. $linearCombos is a list describing each dependency, and $remove is a vector of column indices you can drop to make the matrix full rank.
Column d is the sum of columns a and b, so the matrix has only three independent columns even though it has four.
findLinearCombos() syntax and arguments
findLinearCombos() takes a single argument. You pass x, a numeric matrix, and the function returns a list. It does not accept a data frame directly, so convert one with as.matrix() or build a numeric design matrix with model.matrix() first.
The return list always has the same two components.
| Component | Type | Meaning |
|---|---|---|
$linearCombos |
list of integer vectors | Each vector: the first index is the dependent column, the rest combine to produce it |
$remove |
integer vector or NULL |
Column indices to drop so the matrix becomes full rank |
model.matrix() before calling the function.findLinearCombos() examples by use case
Detect and read the result
Run findLinearCombos() on the matrix to see the dependency. The output names both the redundant column and the columns it depends on.
The vector 4 1 2 reads as "column 4 is a linear combination of columns 1 and 2". The $remove slot recommends dropping column 4 to resolve it.
Drop the recommended columns
Use $remove to subset the matrix. Negative indexing keeps every column except the redundant ones.
The reduced matrix has three independent columns and is now full rank.
A full-rank matrix returns nothing
A matrix with no dependencies returns empty results. $linearCombos is an empty list and $remove is NULL.
Catch the dummy variable trap
The most common real-world dependency is one-hot encoding plus an intercept. When every factor level gets its own column, those columns sum to the intercept, producing an exact linear combination.
findLinearCombos() vs findCorrelation() and nearZeroVar()
Three caret functions screen problematic predictors, but each targets a different problem. findLinearCombos() catches exact rank deficiency, findCorrelation() catches strong pairwise correlation, and nearZeroVar() catches near-constant columns.
| Function | Targets | Detection method | Use when |
|---|---|---|---|
findLinearCombos() |
Exact linear dependencies | QR decomposition | Dummy traps, engineered sums, redundant ratios |
findCorrelation() |
High pairwise correlation | Correlation matrix | Two predictors nearly duplicate each other |
nearZeroVar() |
Near-constant columns | Variance and frequency ratio | A column has one dominant value |
The decision rule is simple. Run findLinearCombos() first to remove columns that break the math outright, then run findCorrelation() to thin out redundancy that only hurts stability.
findCorrelation() to handle near-redundant predictors.Common pitfalls
Three mistakes account for most findLinearCombos() trouble. Each has a quick fix.
First, passing a data frame with factors fails. The matrix coercion turns numbers into text, so encode factors numerically before the call.
Second, findLinearCombos() only reports. It never modifies your data, so you must subset with $remove yourself.
Third, an empty $remove is NULL, and x[, -NULL] silently drops every column. Always guard the subset.
Try it yourself
Try it: Build a matrix where column 3 equals column 1 minus column 2, then use findLinearCombos() to find which column to drop. Save the result to ex_combo.
Click to reveal solution
Explanation: Column 3 is an exact linear combination of columns 1 and 2, so the QR step flags it as redundant and $remove returns its index, 3.
Related caret functions
findLinearCombos() is one step in caret's predictor-screening toolkit. These functions handle the adjacent tasks.
- findCorrelation() flags highly correlated predictor pairs.
- nearZeroVar() flags columns with almost no variance.
- preProcess() centers, scales, and imputes predictors.
- dummyVars() builds one-hot encoded design matrices.
- train() fits a model once the predictors are clean.
FAQ
What does findLinearCombos() return when there are no linear combinations?
It returns a list where $linearCombos is an empty list (list()) and $remove is NULL. That means the matrix is already full rank and no columns need to be dropped. Always test length(combo$remove) before subsetting, because indexing with -NULL drops every column instead of none.
Does findLinearCombos() detect correlated variables?
No. It only detects exact linear dependencies, where one column equals a precise weighted sum of others. Two predictors correlated at 0.95 are not an exact combination, so findLinearCombos() ignores them. Use findCorrelation() on a correlation matrix to screen for strong but inexact relationships.
Why does findLinearCombos() need a matrix instead of a data frame?
The function relies on a QR decomposition, which operates on a purely numeric matrix. A data frame with factor or character columns coerces to a character matrix and the math fails. Convert numeric frames with as.matrix(), or build a design matrix with model.matrix() to encode factors first.
How does findLinearCombos() decide which column to remove?
It works through the dependencies and, for each one, marks a later column in the combination for removal. After flagging a column it rechecks the rank, so the final $remove vector is a minimal set that makes the whole matrix full rank.