Model Selection Approaches
It is possible to build multiple models from a given set of X variables. But building a good quality model can make all the difference. Here, we explore various approaches to build and evaluate regression models.
Data Prep
Lets prepare the data upon which the various model selection approaches will be applied. A dataframe containing only the predictors and one containing the response variable is created for use in the model seection algorithms.
inputData <- read.csv("http://rstatistics.net/wp-content/uploads/2015/09/ozone2.csv", stringsAsFactors=F)
response_df <- inputData['ozone_reading'] # Y variable
predictors_df <- inputData[, !names(inputData) %in% "ozone_reading" ] # X variables
head(inputData)
#=> Month Day_of_month Day_of_week ozone_reading pressure_height Wind_speed Humidity
#=> 1 1 4 3.01 5480 8 20.00000
#=> 1 2 5 3.20 5660 6 48.41432
#=> 1 3 6 2.70 5710 4 28.00000
#=> 1 4 7 5.18 5700 3 37.00000
#=> 1 5 1 5.34 5760 3 51.00000
#=> 1 6 2 5.77 5720 4 69.00000
#=>
#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height Pressure_gradient
#=> 37.78175 35.31509 5000.000 -15
#=> 38.00000 45.79294 4060.589 -14
#=> 40.00000 48.48006 2693.000 -25
#=> 45.00000 49.19898 590.000 -24
#=> 54.00000 45.32000 1450.000 25
#=> 35.00000 49.64000 1568.000 15
#=>
#=> Inversion_temperature Visibility
#=> 30.56000 200
#=> 46.86914 300
#=> 47.66000 250
#=> 55.04000 100
#=> 57.02000 60
#=> 53.78000 60
Stepwise Regression
In stepwise regression, we pass the full model to step
function. It iteratively searches the full scope of variables in backwards
directions by default, if scope
is not given. It performs multiple iteractions by droping one X variable at a time. In each iteration, multiple models are built by dropping each of the X variables at a time. The AIC
of the models is also computed and the model that yields the lowest AIC is retained for the next iteration.
In simpler terms, the variable that gives the minimum AIC when dropped, is dropped for the next iteration, until there is no significant drop in AIC is noticed.
The code below shows how stepwise regression can be done. We are providing the full model here, so a backwards stepwise will be performed, which means, variables will only be removed. In forward stepwise, variables will be progressively added.
lmMod <- lm(ozone_reading ~ . , data = inputData)
selectedMod <- step(lmMod)
summary(selectedMod)
#=> Call:
#=> lm(formula = ozone_reading ~ Month + pressure_height + Wind_speed +
#=> Humidity + Temperature_Sandburg + Temperature_ElMonte + Inversion_base_height,
#=> data = inputData)
#=>
#=> Residuals:
#=> Min 1Q Median 3Q Max
#=> -13.5219 -2.6652 -0.1885 2.5702 12.7184
#=>
#=> Coefficients:
#=> Estimate Std. Error t value Pr(>|t|)
#=> (Intercept) 97.9206462 27.5285900 3.557 0.000425 ***
#=> Month -0.3632285 0.0752403 -4.828 2.05e-06 ***
#=> pressure_height -0.0218974 0.0051670 -4.238 2.87e-05 ***
#=> Wind_speed -0.1738621 0.1207299 -1.440 0.150715
#=> Humidity 0.0817383 0.0132480 6.170 1.85e-09 ***
#=> Temperature_Sandburg 0.1532862 0.0403667 3.797 0.000172 ***
#=> Temperature_ElMonte 0.5149553 0.0686170 7.505 4.92e-13 ***
#=> Inversion_base_height -0.0003529 0.0001743 -2.025 0.043629 *
#=> ---
#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#=>
#=> Residual standard error: 4.233 on 358 degrees of freedom
#=> Multiple R-squared: 0.7186, Adjusted R-squared: 0.7131
#=> F-statistic: 130.6 on 7 and 358 DF, p-value: < 2.2e-16
all_vifs <- car::vif(selectedMod)
print(all_vifs)
#=> Month pressure_height Wind_speed Humidity
#=> 1.377397 5.995937 1.330647 1.386716
#=>
#=> Temperature_Sandburg Temperature_ElMonte Inversion_base_height
#=> 6.781597 11.616208 1.926758
Multicollinearity and Statistical Significance
Say, one of the methods discussed above or below has given us a best model based on a criteria such as Adj-Rsq. It is not guaranteed that the condition of multicollinearity (checked using car::vif
) will be satisfied or even the model be statistically significant. To satisfy these two conditions, the below approach can be taken.
Recursively remove variables with VIF > 4
signif_all <- names(all_vifs)
# Remove vars with VIF> 4 and re-build model until none of VIFs don't exceed 4.
while(any(all_vifs > 4)){
var_with_max_vif <- names(which(all_vifs == max(all_vifs))) # get the var with max vif
signif_all <- signif_all[!(signif_all) %in% var_with_max_vif] # remove
myForm <- as.formula(paste("ozone_reading ~ ", paste (signif_all, collapse=" + "), sep="")) # new formula
selectedMod <- lm(myForm, data=inputData) # re-build model with new formula
all_vifs <- car::vif(selectedMod)
}
summary(selectedMod)
# Call:
# lm(formula = myForm, data = inputData)
#
# Residuals:
# Min 1Q Median 3Q Max
# -15.5859 -3.4922 -0.3876 3.1741 16.7640
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -2.007e+02 1.942e+01 -10.335 < 2e-16 ***
# Month -2.322e-01 8.976e-02 -2.587 0.0101 *
# pressure_height 3.607e-02 3.349e-03 10.773 < 2e-16 ***
# Wind_speed 2.346e-01 1.423e-01 1.649 0.1001
# Humidity 1.391e-01 1.492e-02 9.326 < 2e-16 ***
# Inversion_base_height -1.122e-03 1.975e-04 -5.682 2.76e-08 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 5.172 on 360 degrees of freedom
# Multiple R-squared: 0.5776, Adjusted R-squared: 0.5717
# F-statistic: 98.45 on 5 and 360 DF, p-value: < 2.2e-16
car::vif(selectedMod)
# Month pressure_height Wind_speed Humidity Inversion_base_height
# 1.313154 1.687105 1.238613 1.178276 1.658603
The VIFs of all the X’s are below 2 now. So, the condition of multicollinearity is satisfied. But the variable wind_speed
in the model with p value > .1 is not statistically significant. For this specific case, we could just re-build the model without wind_speed
and check all variables are statistically significant. But, what if you had a different data that selected a model with 2 or more non-significant variables. What if, you had to select models for many such data. So, lets write a generic code for this.
Recursively remove non-significant variables
all_vars <- names(selectedMod[[1]])[-1] # names of all X variables
# Get the non-significant vars
summ <- summary(selectedMod) # model summary
pvals <- summ[[4]][, 4] # get all p values
not_significant <- character() # init variables that aren't statsitically significant
not_significant <- names(which(pvals > 0.1))
not_significant <- not_significant[!not_significant %in% "(Intercept)"] # remove 'intercept'. Optional!
# If there are any non-significant variables,
while(length(not_significant) > 0){
all_vars <- all_vars[!all_vars %in% not_significant[1]]
myForm <- as.formula(paste("ozone_reading ~ ", paste (all_vars, collapse=" + "), sep="")) # new formula
selectedMod <- lm(myForm, data=inputData) # re-build model with new formula
# Get the non-significant vars.
summ <- summary(selectedMod)
pvals <- summ[[4]][, 4]
not_significant <- character()
not_significant <- names(which(pvals > 0.1))
not_significant <- not_significant[!not_significant %in% "(Intercept)"]
}
summary(selectedMod)
#=> Call:
#=> lm(formula = myForm, data = inputData)
#
#=> Residuals:
#=> Min 1Q Median 3Q Max
#=> -15.1537 -3.5541 -0.2294 3.2273 17.0106
#=>
#=> Coefficients:
#=> Estimate Std. Error t value Pr(>|t|)
#=> (Intercept) -1.989e+02 1.944e+01 -10.234 < 2e-16 ***
#=> Month -2.694e-01 8.709e-02 -3.093 0.00213 **
#=> pressure_height 3.589e-02 3.355e-03 10.698 < 2e-16 ***
#=> Humidity 1.466e-01 1.426e-02 10.278 < 2e-16 ***
#=> Inversion_base_height -1.047e-03 1.927e-04 -5.435 1.01e-07 ***
#=> ---
#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#=>
#=> Residual standard error: 5.184 on 361 degrees of freedom
#=> Multiple R-squared: 0.5744, Adjusted R-squared: 0.5697
#=> F-statistic: 121.8 on 4 and 361 DF, p-value: < 2.2e-16
#=> car::vif(selectedMod)
#=> Month pressure_height Humidity Inversion_base_height
#=> 1.230346 1.685245 1.071214 1.570431
In the resulting model, both statistical significance and multicollinearity is acceptable.
Best subsets
Best subsets is a technique that relies on stepwise regression to search, find and visualise regression models. But unlike stepwise regression, you have more options to see what variables were included in various shortlisted models, force-in or force-out some of the explanatory variables and also visually inspect the model’s performance w.r.t Adj R-sq.
library(leaps)
regsubsetsObj <- regsubsets(x=predictors_df ,y=response_df, nbest = 2, really.big = T)
plot(regsubsetsObj, scale = "adjr2") # regsubsets plot based on R-sq
How to interpret the regsubsets
plot?
The regsubsets
plot shows the adjusted R-sq along the Y-axis for many models created by combinations of variables shown on the X-axis. For instance, draw an imaginary horizontal line along the X-axis from any point along the Y-axis. That line would correspond to a linear model, where, the black boxes that line touches form the X variables. For example, the red line in the image touches the black boxes belonging to Intercept
, Month
, pressure_height
, Humidity
, Temperature_Sandburg
and Temperature_Elmonte
. The Adjusted R-sq for that model is the value at which the red line touches the Y-axis.
The caveat however is that it is not guaranteed that these models will be statistically significant.
Leaps
Leaps is similar to best subsets but is known to use a better algorithm to shortlist the models.
library(leaps)
leapSet <- leaps(x=predictors_df, y=inputData$ozone_reading, nbest = 1, method = "adjr2") # criterion could be one of "Cp", "adjr2", "r2". Works for max of 32 predictors.
#=> $which
#=> 1 2 3 4 5 6 7 8 9 A B C
#=> 1 FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#=> 2 FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
#=> 3 TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
#=> 4 TRUE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
#=> 5 TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#=> 6 TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#=> 7 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
#=> 8 TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
#=> 9 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE
#=> 10 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
#=> 11 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
#=> 12 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#=>
#=> $adjr2
#=> [1] 0.5945612 0.6544828 0.6899196 0.6998209 0.7079506 0.7122214 0.7130796 0.7134627 0.7130404 0.7125416
#=> [11] 0.7119148 0.7112852
# Suppose, we want to choose a model with 4 variables.
selectVarsIndex <- leapSet$which[4, ] # pick selected vars
newData <- cbind(response_df, predictors_df[, selectVarsIndex]) # new data for building selected model
selectedMod <- lm(ozone_reading ~ ., data=newData) # build model
summary(selectedMod)
#=> Call:
#=> lm(formula = ozone_reading ~ ., data = newData)
#=>
#=> Residuals:
#=> Min 1Q Median 3Q Max
#=> -13.9636 -2.8928 -0.0581 2.8549 12.6286
#=>
#=> Coefficients:
#=> Estimate Std. Error t value Pr(>|t|)
#=> (Intercept) 74.611786 27.188323 2.744 0.006368 **
#=> Month -0.426133 0.069892 -6.097 2.78e-09 ***
#=> pressure_height -0.018478 0.005137 -3.597 0.000366 ***
#=> Humidity 0.096978 0.012529 7.740 1.01e-13 ***
#=> Temperature_ElMonte 0.704866 0.049984 14.102 < 2e-16 ***
#=> ---
#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#=>
#=> Residual standard error: 4.33 on 361 degrees of freedom
#=> Multiple R-squared: 0.7031, Adjusted R-squared: 0.6998
#=> F-statistic: 213.7 on 4 and 361 DF, p-value: < 2.2e-16
RegBest() from FactoMineR
library(FactoMineR)
regMod <- RegBest(y=inputData$ozone_reading, x = predictors_df)
regMod$all # summary of best model of all sizes based on Adj A-sq
regMod$best # best model
#=> Call:
#=> lm(formula = as.formula(as.character(formul)), data = don)
#=>
#=> Residuals:
#=> Min 1Q Median 3Q Max
#=> -13.6805 -2.6589 -0.1952 2.6045 12.6521
#=>
#=> Coefficients:
#=> Estimate Std. Error t value Pr(>|t|)
#=> (Intercept) 88.8519747 26.8386969 3.311 0.001025 **
#=> Month -0.3354044 0.0728259 -4.606 5.72e-06 ***
#=> pressure_height -0.0202670 0.0050489 -4.014 7.27e-05 ***
#=> Humidity 0.0784813 0.0130730 6.003 4.73e-09 ***
#=> Temperature_Sandburg 0.1450456 0.0400188 3.624 0.000331 ***
#=> Temperature_ElMonte 0.5069526 0.0684938 7.401 9.65e-13 ***
#=> Inversion_base_height -0.0004224 0.0001677 -2.518 0.012221 *
#=> ---
#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#=>
#=> Residual standard error: 4.239 on 359 degrees of freedom
#=> Multiple R-squared: 0.717, Adjusted R-squared: 0.7122
#=> F-statistic: 151.6 on 6 and 359 DF, p-value: < 2.2e-16
Simulated Annealing
Given a set of variables, a simulated annealing algorithm seeks a k-variable subset which is optimal, as a surrogate for the whole set, with respect to a given criterion. Annealing offers a method of finding the best subsets of predictor variables. Since the correlation or covariance matrix is a input to the anneal()
function, only continuous variables are used to compute the best subsets.
library(subselect)
results <- anneal(cor(predictors_df), kmin=1, kmax=ncol(predictors_df)-1, nsol=4, niter=10, setseed=TRUE) # perform annealing<
print(results$bestsets)
#=> Var.1 Var.2 Var.3 Var.4 Var.5 Var.6 Var.7 Var.8 Var.9 Var.10 Var.11
#=> Card.1 11 0 0 0 0 0 0 0 0 0 0
#=> Card.2 7 10 0 0 0 0 0 0 0 0 0
#=> Card.3 5 6 8 0 0 0 0 0 0 0 0
#=> Card.4 1 2 6 11 0 0 0 0 0 0 0
#=> Card.5 1 3 5 6 11 0 0 0 0 0 0
#=> Card.6 2 3 5 6 9 11 0 0 0 0 0
#=> Card.7 1 2 3 5 10 11 12 0 0 0 0
#=> Card.8 1 2 3 4 5 6 8 12 0 0 0
#=> Card.9 1 2 3 4 5 6 9 10 12 0 0
#=> Card.10 1 2 3 4 5 6 8 9 10 12 0
#=> Card.11 1 2 3 4 5 6 7 8 9 10 12
The bestsets value in the output reveal the best variables to select for each cardinality (number of predictors). The values inside results$bestsets
correspond to the column index position of predicted_df
, that is, which variables are selected for each cardinality.
num_vars <- 3
selectVarsIndex <- results$bestsets[num_vars, 1:num_vars]
newData <- cbind(response_df, predictors_df[, selectVarsIndex]) # new data for building selected model
selectedMod <- lm(ozone_reading ~ ., data=newData) # build model
summary(selectedMod)
#=> Call:
#=> lm(formula = ozone_reading ~ ., data = newData)
#=>
#=> Residuals:
#=> Min 1Q Median 3Q Max
#=> -14.6948 -2.7279 -0.3532 2.9004 13.4161
#=>
#=> Coefficients:
#=> Estimate Std. Error t value Pr(>|t|)
#=> (Intercept) -23.98819 1.50057 -15.986 < 2e-16 ***
#=> Wind_speed 0.08796 0.11989 0.734 0.464
#=> Humidity 0.11169 0.01319 8.468 6.34e-16 ***
#=> Temperature_ElMonte 0.49985 0.02324 21.506 < 2e-16 ***
#=> ---
#=> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 4.648 on 362 degrees of freedom
# Multiple R-squared: 0.6569, Adjusted R-squared: 0.654
# F-statistic: 231 on 3 and 362 DF, p-value: < 2.2e-16
Like other methods, anneal()
does not guarantee that the model be statistically significant.
Comparing Models Using ANOVA
If you have two or more models that are subsets of a larger model, you can use anova()
to check if the additional variable(s) contribute to the predictive ability of the model. In below example, the baseMod
is a model built with 7 explanatory variables, while, mod1
through mod5
contain one predictor less than the previous model.
# ANOVA
baseMod <- lm(ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + Temperature_ElMonte + Inversion_base_height + Wind_speed, data=inputData)
mod1 <- lm(ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + Temperature_ElMonte + Inversion_base_height, data=inputData)
mod2 <- lm(ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg + Temperature_ElMonte, data=inputData)
mod3 <- lm(ozone_reading ~ Month + pressure_height + Humidity + Temperature_ElMonte, data=inputData)
mod4 <- lm(ozone_reading ~ Month + pressure_height + Temperature_ElMonte, data=inputData)
anova(baseMod, mod1, mod2, mod3, mod4)
#=> Model 1: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg +
#=> Temperature_ElMonte + Inversion_base_height + Wind_speed
#=> Model 2: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg +
#=> Temperature_ElMonte + Inversion_base_height
#=> Model 3: ozone_reading ~ Month + pressure_height + Humidity + Temperature_Sandburg +
#=> Temperature_ElMonte
#=> Model 4: ozone_reading ~ Month + pressure_height + Humidity + Temperature_ElMonte
#=> Model 5: ozone_reading ~ Month + pressure_height + Temperature_ElMonte
#=>
#=> Res.Df RSS Df Sum of Sq F Pr(>F)
#=> row 1 358 6414.4
#=> row 2 359 6451.5 -1 -37.16 2.0739 0.150715
#=> row 3 360 6565.5 -1 -113.98 6.3616 0.012095 *
#=> row 4 361 6767.0 -1 -201.51 11.2465 0.000883 ***
#=> row 5 362 7890.0 -1 -1123.00 62.6772 3.088e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
For each row in the output, the anova()
tests a hypothesis comparing two models. For instance, row 2
compares baseMod
(Model 1) and mod1
(Model 2) in the output. The null hypothesis is that the two models are equal in fitting the data (i.e. the Y variable), while, the alternative hypothesis is that the full model is better (i.e. the additional X variable improves the model).
So what’s the inference? Except for row 2, all other rows have significant p values. This means all the additional variables in models 1, 2 and 3 are contributing to respective models. From row 1 output, the Wind_speed
is not making the baseMod
(Model 1) any better. So the best model we have amongst this set is mod1
(Model1).