r-statistics.co by Selva Prabhakaran


Ridge Regression

Ridge Regression is a commonly used technique to address the problem of multi-collinearity. The effectiveness of the application is however debatable.

Introduction

Let us see a use case of the application of Ridge regression on the longley dataset. We will try to predict the GNP.deflator using lm() with the rest of the variables as predictors. This model and results will be compared with the model created using ridge regression.

library (car) # for VIF
library (ridge)
data(longley, package="datasets")  # initialize data
head (longley, 4)  # show top 4 rows of data
#>      GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
#> 1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
#> 1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
#> 1949         88.2 258.054      368.2        161.6    109.773 1949   60.171
#> 1950         89.5 284.599      335.1        165.0    110.929 1950   61.187
inputData <- data.frame (longley) # plug in your data here
colnames(inputData)[1] <- "response"  # rename response var

Calculate Correlations

XVars <- inputData[, -1] # X variables
round(cor(XVars), 2) # Correlation Test
#>               GNP Unemployed Armed.Forces Population Year Employed
#> GNP          1.00       0.60         0.45       0.99 1.00     0.98
#> Unemployed   0.60       1.00        -0.18       0.69 0.67     0.50
#> Armed.Forces 0.45      -0.18         1.00       0.36 0.42     0.46
#> Population   0.99       0.69         0.36       1.00 0.99     0.96
#> Year         1.00       0.67         0.42       0.99 1.00     0.97
#> Employed     0.98       0.50         0.46       0.96 0.97     1.00

Prepare Training And Test Data

set.seed(100) # set seed to replicate results
trainingIndex <- sample(1:nrow(inputData), 0.8*nrow(inputData)) # indices for 80% training data
trainingData <- inputData[trainingIndex, ] # training data
testData <- inputData[-trainingIndex, ] # test data

Predict Using Linear Regression

lmMod <- lm(response ~ ., trainingData)  # the linear reg model
summary (lmMod) # get summary
vif(lmMod) # get VIF

#>  VIF
#>        GNP   Unemployed Armed.Forces   Population         Year     Employed 
#> 1523.74714     93.07635     10.74587    350.58472   2175.29221    182.93609

#>  Coefficients:
#>    (Intercept)           GNP    Unemployed  Armed.Forces    Population          Year    Employed
#>     7652.25192       0.39214       0.06462       0.01573      -2.33550      -3.83113     0.53060

There is significant multi-collinearity between GNP & Year and Population & Employed, with negative coefficients in ‘population’ and ‘Employed’. These variables may not contribute much to explain the dependent variable, nevertheless, lets see what this model predicts.

predicted <- predict (lmMod, testData)  # predict on test data
compare <- cbind (actual=testData$response, predicted)  # combine actual and predicted
#>      actual predicted
#> 1949   88.2  88.45501
#> 1953   99.0  96.67492
#> 1957  108.4 106.59672
#> 1959  112.6 113.31106
mean (apply(compare, 1, min)/apply(compare, 1, max)) # calculate accuracy
#>  98.76%

Apply Ridge Regression On Same Data

linRidgeMod <- linearRidge(response ~ ., data = trainingData)  # the ridge regression model
#>  No more Negative Coefficients!
#>   (Intercept)           GNP    Unemployed  Armed.Forces    Population          Year     Employed
#> -1.015385e+03  3.715498e-02  1.328002e-02  1.707769e-02  1.294903e-01  5.318930e-01 5.976266e-01

predicted <- predict(linRidgeMod, testData)  # predict on test data
compare <- cbind (actual=testData$response, predicted)  # combine
#>      actual predicted
#> 1949   88.2  88.68584
#> 1953   99.0  99.26104
#> 1957  108.4 106.99370
#> 1959  112.6 110.95450
mean (apply(compare, 1, min)/apply(compare, 1, max)) # calculate accuracy
#>  99.10%

Clearly, in this case, ridge regression is successful in improving the accuracy by a minor but significant fraction.

Predicting With A Re-calibrated Linear Model

newlmMod <- lm(response ~ ., trainingData[, -c(2, 5, 6)]) # without "GNP", "Population" & "Year"
summary (newlmMod) # get summary
vif(newlmMod) # get VIF
#> Coefficients:
#>  (Intercept)    Unemployed  Armed.Forces      Employed  
#>    -62.19771       0.03248       0.02714       2.24039 
#>  VIF
#>   Unemployed Armed.Forces     Employed 
#>     2.124153     1.452648     2.592474
predicted <- predict(newlmMod, testData) # predict on test data
compare <- cbind (actual=testData$response, predicted) # for comparison
mean (apply(compare, 1, min)/apply(compare, 1, max)) # calculate accuracy
#>  99.21%

The re-calibrated linear model yields better accuracy when the multicollinearity is taken care of. This analysis may not be sufficient to draw conclusions about the effectiveness of ridge regression. The intention, however, is to open up considerations for new modeling options for problem solving.