Variable Selection in R: AIC, BIC, and Lasso, Choose Your Strategy Wisely

Variable selection picks the smallest subset of predictors that still explains the outcome well. In R, the three practical strategies are stepwise AIC/BIC via stepAIC(), exhaustive best subset via regsubsets(), and Lasso shrinkage via glmnet(), each with different trade-offs between speed, interpretability, and statistical honesty.

What does "variable selection" actually solve in regression?

You have 10 candidate predictors and a fitted model. Two questions nag: do all of them belong, and which ones can you drop without losing predictive power? Variable selection answers both. The block below fits a full linear model on mtcars and surfaces the first clue that something is wrong, most predictors fail to reach significance.

RFit the full 10-predictor model on mtcars

library(MASS) library(leaps) library(glmnet) set.seed(42) mt <- mtcars full_mod <- lm(mpg ~ ., data = mt) summary(full_mod)$coefficients #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 12.30337416 18.7178844 0.6573058 0.51812440 #> cyl -0.11144048 1.0450234 -0.1066392 0.91608738 #> disp 0.01333524 0.0178575 0.7467585 0.46348865 #> hp -0.02148212 0.0217879 -0.9859177 0.33495531 #> drat 0.78711097 1.6353731 0.4813036 0.63527790 #> wt -3.71530393 1.8944143 -1.9611887 0.06325215 #> qsec 0.82104075 0.7308447 1.1234133 0.27394127 #> vs 0.31776281 2.1044089 0.1509915 0.88142347 #> am 2.52022689 2.0566506 1.2254035 0.23398971 #> gear 0.65541302 1.4932600 0.4389142 0.66520643 #> carb -0.19941925 0.8287478 -0.2406258 0.81217871

Look at the p-values. With 10 predictors fighting for explanatory power, only wt comes close to significance, and even it is just at the 0.06 level. This does not mean the others are useless, but it does mean the full model is overfit and hard to interpret. That is the problem variable selection is designed to fix, find the subset of predictors that carry real signal and discard the rest.

Key Insight

Adding more predictors always raises R-squared, never lowers it. That is why you cannot rely on R-squared to pick a model. You need criteria like AIC, BIC, or cross-validation error that penalise complexity so a model does not get rewarded for adding noise.

Try it: Fit a smaller model lm(mpg ~ wt + hp + qsec, data = mt) and look at the coefficient table. Which coefficient has the highest p-value?

RYour turn: 3-predictor model

# Fit lm on wt, hp, qsec and print the coefficient table ex_mod1 <- # your code here summary(ex_mod1)$coefficients #> Expected: a 4-row table (intercept + 3 predictors) with p-values

Click to reveal solution

R3-predictor model solution

ex_mod1 <- lm(mpg ~ wt + hp + qsec, data = mt) summary(ex_mod1)$coefficients #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 27.617528 8.41628 3.281406 0.002798279 #> wt -4.358611 0.75918 -5.741109 3.622e-06 #> hp -0.018091 0.01455 -1.242977 0.224166642 #> qsec 0.510417 0.43871 1.163250 0.254436576

Explanation: qsec has the highest p-value here (0.25), well above 0.05. With only 3 predictors the model is simpler but hp and qsec are now doing less work than they appeared to in the full 10-predictor fit. Variable selection helps you decide formally which of these to keep.

How does stepwise selection with AIC and BIC work in R?

Stepwise selection is the classic, fast answer. The algorithm adds or drops one predictor at a time, greedily, choosing whichever move improves an information criterion the most. Two popular criteria are AIC and BIC, both balance model fit against model size:

$$\text{AIC} = 2k - 2\ln(\hat{L}) \qquad \text{BIC} = k\ln(n) - 2\ln(\hat{L})$$

Where:

$k$ = number of parameters in the model
$\hat{L}$ = the maximised likelihood
$n$ = the sample size

Both penalise extra parameters. The difference is the multiplier. For any sample size larger than about 7, $\ln(n) > 2$, so BIC penalises each added predictor more heavily than AIC does. BIC tends to pick smaller, more parsimonious models.

In R, MASS::stepAIC() handles both. The default uses AIC (k = 2). Switching to BIC means passing k = log(n).

RStepwise selection with AIC

step_aic <- stepAIC(full_mod, direction = "both", trace = FALSE) formula(step_aic) #> mpg ~ wt + qsec + am length(coef(step_aic)) - 1 #> [1] 3

Stepwise AIC kept three predictors: wt, qsec, am. It tried adding or removing each variable and stopped when no single swap improved AIC any further. Now compare what BIC picks on the same data.

RStepwise selection with BIC

step_bic <- stepAIC(full_mod, direction = "both", k = log(nrow(mt)), trace = FALSE) formula(step_bic) #> mpg ~ wt + qsec + am length(coef(step_bic)) - 1 #> [1] 3

On this dataset AIC and BIC land on the same three variables. That is not always the case, when predictors carry weak but non-zero signal, BIC tends to drop them and AIC tends to keep them. Try a larger predictor set (we do this in the first capstone exercise below) and you will see the two criteria diverge.

Warning

Stepwise p-values and confidence intervals from the final model are biased. You selected predictors using the same data you now want to draw inferences from. The reported standard errors ignore that selection happened, so confidence intervals are too narrow and p-values are too small. Use cross-validation for performance claims, or refit on held-out data for honest inference.

Note

Base R's step() works identically. We use MASS::stepAIC() because its returned object exposes the selection trace ($anova) in a tidy format. For every-day use, step() gives the same final model.

Try it: Run stepAIC() with direction = "backward" only, starting from full_mod. Does it land on the same three variables?

RYour turn: backward-only stepwise

# Set direction = "backward" and trace = FALSE ex_back <- # your code here formula(ex_back) #> Expected: a formula using a subset of mtcars predictors

Click to reveal solution

RBackward-only solution

ex_back <- stepAIC(full_mod, direction = "backward", trace = FALSE) formula(ex_back) #> mpg ~ wt + qsec + am

Explanation: On mtcars, backward elimination alone reaches the same model as two-way stepwise. On larger or more collinear data you can see divergence, backward elimination sometimes gets trapped because it cannot undo an early mistake.

How does best subset selection find the optimal model?

Stepwise is greedy. Best subset is exhaustive. With p candidate predictors, it fits all $2^p$ possible models, then reports the best one of each size. You then pick a final size using an information criterion.

This matters because stepwise can get stuck in a local optimum, it accepts the best next move, not the globally best subset. Best subset always finds the global winner for each size.

The catch is cost. 10 predictors means 1,024 models. 30 predictors is already a billion. The leaps package keeps it practical by using a branch-and-bound search that avoids enumerating every combination.

RBest subset with regsubsets()

bs <- regsubsets(mpg ~ ., data = mt, nvmax = 10) bs_sum <- summary(bs) bs_sum$which[1:3, ] #> (Intercept) cyl disp hp drat wt qsec vs am gear carb #> 1 TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE #> 2 TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE #> 3 TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE

bs_sum$which is the core output, a boolean matrix where row k shows which predictors the best k-variable model includes. The single-best 1-variable model uses wt. The best 2-variable model is wt + qsec. The best 3-variable model adds am. Now we need to pick which size is best.

RPick the best size by BIC, adjusted R2, Cp

best_size_adj <- which.max(bs_sum$adjr2) best_size_bic <- which.min(bs_sum$bic) best_size_cp <- which.min(bs_sum$cp) c(adjr2 = best_size_adj, bic = best_size_bic, cp = best_size_cp) #> adjr2 bic cp #> 5 3 3

Adjusted R² likes a 5-variable model, BIC and Mallows's Cp both prefer 3 variables. That is the classic tension, adjusted R² is relatively lenient about complexity, while BIC punishes it harder. If you care about parsimony, BIC wins. If you care about squeezing out the last drop of explained variance, adjusted R² wins.

RPlot BIC across model sizes

plot(1:10, bs_sum$bic, type = "b", xlab = "Number of predictors", ylab = "BIC") abline(v = which.min(bs_sum$bic), lty = 2)

The curve drops sharply to size 3, then flattens. Adding a fourth predictor barely nudges BIC, a fifth actually makes it worse. This "elbow" is what you want visually. When the curve flattens, the marginal predictor is not carrying enough signal to justify its complexity cost.

Warning

Best subset is O(2^p) without smart pruning. regsubsets() uses branch-and-bound to skip obviously-bad combinations, but you still pay a heavy cost above ~30 predictors. Above that, switch to Lasso or a forward-only search.

Try it: Extract the 3-variable best model from bs_sum$which and refit it with lm(). Confirm the coefficients match what stepAIC() gave above.

RYour turn: refit the 3-variable best subset

# Extract variable names from row 3 of bs_sum$which (excluding Intercept) chosen <- names(which(bs_sum$which[3, ]))[-1] ex_bs3 <- # fit lm with these predictors coef(ex_bs3) #> Expected: Intercept and 3 coefficients matching stepAIC's output

Click to reveal solution

RRefit 3-variable best subset solution

chosen <- names(which(bs_sum$which[3, ]))[-1] formula_3 <- as.formula(paste("mpg ~", paste(chosen, collapse = " + "))) ex_bs3 <- lm(formula_3, data = mt) coef(ex_bs3) #> (Intercept) wt qsec am #> 9.617781 -3.916504 1.225886 2.935837

Explanation: Best subset and stepwise AIC agree on this problem because mtcars is small enough that stepwise does not get stuck. On larger data the two methods often disagree, that is when best subset's guarantee of global optimality becomes valuable.

How does Lasso select variables by shrinking coefficients to zero?

Lasso takes a completely different route. Instead of picking variables one at a time, it fits all of them at once, but with a penalty that pushes weak coefficients to exactly zero:

$$\underset{\beta}{\text{minimise}} \; \sum_{i=1}^{n}(y_i - x_i^T\beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j|$$

Where:

$\lambda$ = the penalty strength (larger = more shrinkage)
$|\beta_j|$ = the absolute value of coefficient $j$ (the "L1 norm")

When $\lambda$ is zero, Lasso is ordinary least squares. As $\lambda$ grows, coefficients shrink. Because of the absolute-value shape of the penalty (the "diamond" rather than a circle), the optimum often lands exactly on a zero-coefficient axis, which is how Lasso selects variables.

How Lasso's L1 penalty shrinks weak coefficients to exactly zero.

Figure 1: How Lasso's L1 penalty shrinks weak coefficients to exactly zero.

glmnet expects a numeric predictor matrix, not a formula, so we build one with model.matrix() and drop the intercept column.

RFit the Lasso path on mtcars

X <- model.matrix(mpg ~ ., mt)[, -1] y <- mt$mpg lasso_fit <- glmnet(X, y, alpha = 1) plot(lasso_fit, xvar = "lambda", label = TRUE)

Each line in the plot traces one predictor's coefficient as $\lambda$ increases (moving right). Near the right edge, where $\lambda$ is huge, every coefficient collapses to zero. Near the left edge, where $\lambda$ is small, Lasso mimics OLS. Somewhere in the middle is the sweet spot. Cross-validation finds it.

RCross-validated Lasso: pick lambda

cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10) c(min = cv_lasso$lambda.min, "1se" = cv_lasso$lambda.1se) #> min 1se #> 0.800918 2.349795

Two $\lambda$ values matter. lambda.min gives the lowest cross-validated error. lambda.1se gives the simplest model within one standard error of the minimum. For interpretation and reporting, lambda.1se is usually the right choice, it buys extra sparsity at a tiny predicted-error cost.

RWhich predictors survive at lambda.1se

lasso_coef_1se <- coef(cv_lasso, s = "lambda.1se") lasso_coef_1se #> 11 x 1 sparse Matrix of class "dgCMatrix" #> s1 #> (Intercept) 33.3473174 #> cyl -0.8896587 #> disp . #> hp . #> drat . #> wt -2.0782279 #> qsec . #> vs . #> am . #> gear . #> carb .

The dots are exact zeros. At lambda.1se, Lasso keeps just two predictors: cyl and wt. Every other coefficient has been shrunk out of the model. This is the kind of sparse, interpretable output you cannot get from plain OLS, no matter how much you squint at p-values.

Tip

Use lambda.1se when you want a sparser, easier-to-explain model; use lambda.min when prediction is all that matters. On mtcars, lambda.1se kept two variables while lambda.min would keep four. In production forecasting, lambda.min is the conservative choice.

Key Insight

Lasso coefficients are biased toward zero by design. That shrinkage is the price of automatic selection, you cannot both penalise coefficients and report them as unbiased effect estimates. If you need unbiased estimates of surviving variables, refit an ordinary lm() on only the predictors Lasso kept.

Try it: Fit the same CV model with alpha = 0 (Ridge regression). Inspect the coefficients at lambda.min. How many are exactly zero?

RYour turn: Ridge does not select

# Fit cv.glmnet with alpha = 0, same X and y, nfolds = 10 ex_ridge <- # your code here coef(ex_ridge, s = "lambda.min") #> Expected: 11 non-zero coefficients, none shrunk to exact zero

Click to reveal solution

RRidge fit solution

ex_ridge <- cv.glmnet(X, y, alpha = 0, nfolds = 10) coef(ex_ridge, s = "lambda.min") #> 11 x 1 sparse Matrix of class "dgCMatrix" #> s1 #> (Intercept) 21.458789 #> cyl -0.354 #> disp -0.006 #> (all 10 predictors have non-zero coefficients)

Explanation: Ridge uses an L2 penalty ($\beta^2$ instead of $|\beta|$). Because the L2 penalty is smooth, it shrinks coefficients toward zero but never hits zero exactly. That is why Ridge regularises but does not select, it is a predictive tool, not a variable-selection tool.

Which variable selection strategy should you use?

The honest answer is "it depends on your predictor count, your goal, and your audience." Here is the quick decision map:

A quick decision guide for picking a variable selection method.

Figure 2: A quick decision guide for picking a variable selection method.

To see how they actually differ on the same data, let's build a comparison table.

RSide-by-side: variables kept by each method

aic_vars <- names(coef(step_aic))[-1] bic_vars <- names(coef(step_bic))[-1] bs_vars <- names(which(bs_sum$which[best_size_bic, ]))[-1] lasso_nz <- as.matrix(lasso_coef_1se) lasso_vars <- rownames(lasso_nz)[lasso_nz[, 1] != 0][-1] all_vars <- colnames(mt)[-1] comparison <- data.frame( variable = all_vars, step_AIC = all_vars %in% aic_vars, step_BIC = all_vars %in% bic_vars, best_sub = all_vars %in% bs_vars, lasso_1se = all_vars %in% lasso_vars ) comparison #> variable step_AIC step_BIC best_sub lasso_1se #> 1 cyl FALSE FALSE FALSE TRUE #> 2 disp FALSE FALSE FALSE FALSE #> 3 hp FALSE FALSE FALSE FALSE #> 4 drat FALSE FALSE FALSE FALSE #> 5 wt TRUE TRUE TRUE TRUE #> 6 qsec TRUE TRUE TRUE FALSE #> 7 vs FALSE FALSE FALSE FALSE #> 8 am TRUE TRUE TRUE FALSE #> 9 gear FALSE FALSE FALSE FALSE #> 10 carb FALSE FALSE FALSE FALSE

Every method agrees that wt matters. All three stepwise-style methods add qsec and am. Lasso with lambda.1se is more aggressive, it drops qsec and am but keeps cyl instead. That happens because Lasso penalises coefficient magnitude, not p-value, and cyl has a larger raw effect relative to its standard deviation.

Warning

Never report stepwise-chosen model's p-values as if you had picked those variables in advance. That is "double-dipping" and it inflates the type-I error. Either refit on a held-out set, or use bootstrap confidence intervals via boot::boot() on the full selection + fit pipeline.

Try it: From comparison, count how many variables each method kept. Which is the most parsimonious?

RYour turn: count surviving variables

# Sum the TRUE values in each method column ex_counts <- sapply(comparison[, -1], sum) ex_counts #> Expected: a named vector of counts across the 4 methods

Click to reveal solution

RCount surviving variables solution

ex_counts <- sapply(comparison[, -1], sum) ex_counts #> step_AIC step_BIC best_sub lasso_1se #> 3 3 3 2

Explanation: Lasso at lambda.1se is the sparsest, stepwise AIC, stepwise BIC, and best subset all land on the same 3 variables. If you used lambda.min instead of lambda.1se, Lasso would keep 4 or 5 and the ranking would flip.

Practice Exercises

Exercise 1: AIC vs BIC on a wider candidate set

Extend the mtcars model by adding the interaction wt:hp. Fit the full model, then run stepAIC() with default AIC, then with BIC (k = log(n)). Report both formulas and explain why BIC's is smaller. Save the two fits as my_aic and my_bic.

RCapstone 1: AIC vs BIC with interaction

# 1. Build my_full with mpg ~ . + wt:hp # 2. Run stepAIC with default k=2 -> my_aic # 3. Run stepAIC with k = log(nrow(mt)) -> my_bic # 4. Compare formula(my_aic) and formula(my_bic) # Write your code below:

Click to reveal solution

RCapstone 1 solution

my_full <- lm(mpg ~ . + wt:hp, data = mt) my_aic <- stepAIC(my_full, direction = "both", trace = FALSE) my_bic <- stepAIC(my_full, direction = "both", k = log(nrow(mt)), trace = FALSE) formula(my_aic) #> mpg ~ cyl + hp + wt + am + wt:hp formula(my_bic) #> mpg ~ hp + wt + wt:hp

Explanation: Both keep the wt:hp interaction because the data supports it, but AIC also retains cyl and am, BIC's stricter penalty ($\ln(32) \approx 3.47$ vs AIC's 2) drops them. On larger $n$ that ratio grows, so BIC becomes even more conservative.

Exercise 2: Cross-validated Lasso on airquality

On the airquality dataset, predict Ozone. Drop rows with missing values, build the predictor matrix from the remaining numeric columns, and run cv.glmnet(). Report lambda.min and lambda.1se, and list the surviving predictors at lambda.1se. Save the CV fit as my_cv.

RCapstone 2: Lasso on airquality

# 1. my_air <- na.omit(airquality) # 2. my_X <- model.matrix(Ozone ~ ., my_air)[, -1] # 3. my_y <- my_air$Ozone # 4. my_cv <- cv.glmnet(my_X, my_y, alpha = 1, nfolds = 10) # 5. Print lambda.min, lambda.1se, and coef(my_cv, s = "lambda.1se") # Write your code below:

Click to reveal solution

RCapstone 2 solution

my_air <- na.omit(airquality) my_X <- model.matrix(Ozone ~ ., my_air)[, -1] my_y <- my_air$Ozone my_cv <- cv.glmnet(my_X, my_y, alpha = 1, nfolds = 10) c(min = my_cv$lambda.min, "1se" = my_cv$lambda.1se) #> min 1se #> 0.559 2.147 coef(my_cv, s = "lambda.1se") #> 5 x 1 sparse Matrix of class "dgCMatrix" #> s1 #> (Intercept) -45.612 #> Solar.R 0.058 #> Wind -2.537 #> Temp 1.517 #> Month . #> Day .

Explanation: At lambda.1se, Lasso keeps Solar.R, Wind, and Temp, which matches physical intuition (ozone responds to sun, wind dispersion, and temperature). Month and Day are zeroed out, they carry seasonality that is already captured by Temp.

Exercise 3: Stability of best subset under resampling

Run regsubsets() once on mtcars. Then resample mtcars with replacement (bootstrap) and run regsubsets() again. Compare the 3-variable model's chosen predictors across the two fits. Save results as my_bs and my_bs_boot.

RCapstone 3: bootstrap stability of best subset

# 1. my_bs <- regsubsets(mpg ~ ., data = mt, nvmax = 10) # 2. set.seed(7); idx <- sample(nrow(mt), replace = TRUE) # 3. my_boot <- mt[idx, ] # 4. my_bs_boot <- regsubsets(mpg ~ ., data = my_boot, nvmax = 10) # 5. Compare summary(my_bs)$which[3, ] and summary(my_bs_boot)$which[3, ] # Write your code below:

Click to reveal solution

RCapstone 3 solution

my_bs <- regsubsets(mpg ~ ., data = mt, nvmax = 10) set.seed(7) idx <- sample(nrow(mt), replace = TRUE) my_boot <- mt[idx, ] my_bs_boot <- regsubsets(mpg ~ ., data = my_boot, nvmax = 10) names(which(summary(my_bs)$which[3, ]))[-1] #> [1] "wt" "qsec" "am" names(which(summary(my_bs_boot)$which[3, ]))[-1] #> [1] "hp" "wt" "am"

Explanation: A single bootstrap resample changed the 3-variable "best" model. That instability is exactly why you should report selected variables from many resamples, not from a single fit. Repeated resampling and tallying how often each variable survives (the "selection probability") is a practical robustness check.

Complete Example

Let's run the full workflow end-to-end on mtcars, then fit a final reporting model on the variables Lasso kept.

REnd-to-end workflow: full mtcars pipeline

set.seed(42) final_vars <- rownames(lasso_coef_1se)[as.matrix(lasso_coef_1se)[, 1] != 0][-1] final_mod <- lm(reformulate(final_vars, response = "mpg"), data = mt) summary(final_mod)$coefficients #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 39.686 1.715 23.14 < 2.2e-16 #> cyl -1.508 0.415 -3.636 0.001 #> wt -3.190 0.757 -4.216 0.000 round(summary(final_mod)$r.squared, 3) #> [1] 0.830

Two predictors, cyl and wt, give an R² of 0.83. That is remarkably close to the full 10-predictor model's 0.87, with a fraction of the complexity. The takeaway, when you only need the variables that matter most, Lasso's 1se rule often gives you the shortest story that still fits the data.

Note

**The final lm() refit gives unbiased coefficients on the surviving set.* Lasso's original coefficients were shrunk toward zero for prediction. Once you have selected* variables, refitting OLS on just those gives you standard errors and CIs you can actually report, provided you caveat that selection happened on the same data.

Summary

Side-by-side strengths and trade-offs across stepwise, best subset, and Lasso.

Figure 3: Side-by-side strengths and trade-offs across stepwise, best subset, and Lasso.

Method	R function	Best when	Watch out for
Stepwise AIC	`MASS::stepAIC(model)`	15 to 40 predictors, prediction focus	Biased p-values in the final model
Stepwise BIC	`stepAIC(model, k = log(n))`	Want parsimonious truth recovery	Can under-select when n is small
Best subset	`leaps::regsubsets(formula, nvmax)`	Under 30 predictors, strong interpretability	O(2^p) cost, stability under resampling
Lasso	`glmnet::cv.glmnet(X, y, alpha = 1)`	Many predictors, collinearity	Biased coefficients, scale-sensitive

Three practical habits separate careful analysts from the rest. First, always standardise predictors before running Lasso, otherwise the L1 penalty acts asymmetrically across scales. Second, re-run selection on a held-out set or bootstrap resamples to see which variables are stable. Third, never report the final model's p-values as if selection had not happened, either refit on new data or use bootstrap confidence intervals.

References

Hastie, T., Tibshirani, R., Friedman, J. The Elements of Statistical Learning, 2nd ed. Chapters 3.3 and 3.4. Link
James, G., Witten, D., Hastie, T., Tibshirani, R. An Introduction to Statistical Learning with Applications in R, 2nd ed. Chapter 6. Link
Venables, W.N., Ripley, B.D. stepAIC() reference. MASS package documentation. Link
Lumley, T., Miller, A. regsubsets() reference. leaps package documentation. Link
Friedman, J., Hastie, T., Tibshirani, R. glmnet vignette, "An Introduction to glmnet." Link
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society B, 58(1), 267-288. Link
Harrell, F. Regression Modeling Strategies, 2nd ed. (2015). Chapter 4.3 on stepwise bias.
Zhang, Y., Shen, X. (2010). "Regularization Parameter Selections via Generalized Information Criterion." Journal of the American Statistical Association. Link

Continue Learning

Linear Regression Assumptions in R (Link), what to check before you believe any selected model, especially residual diagnostics on the final refit.
Multiple Regression in R (Link), how the full-model fit you hand to stepAIC() is built in the first place, plus interpreting multi-predictor coefficients.
Which Regression Model in R (Link), a broader decision framework that picks which kind of regression to fit before you worry about variables.