6  Ridge Regression

For variable selection, the Lasso regression is generally preferred over Ridge regression but Ridge is historically prior. The Ridge modification to the standard least-squares solution plus includes bias in the form of a weighted sum of the squared values of the regression coefficients.

Ridge regression: Estimate the coefficients of the model by minimizing the sum of the squares of the residuals along with a weighted sum of the magnitude of the slope coefficients, which shrinks the larger, traditional OLS regression coefficients.

The function that is minimized is the standard least squares function but with a modification. Add the constraint of the weighted sum of the squared regression coefficients, the intercept and slope coefficients.

\[\textrm{minimize:} \;\; \sum(y_i - \hat y_i)^2 + \lambda \sum b^2_j\]

Ridge regression prevents extreme values of the coefficients. This method may be preferred when all or most features contribute to overall fit and predictability.

6.1 Step 2: Search for the Optimal Value of \(\lambda\)

Same principles as with lasso regression for running glmnet(). Find a best value of \(\lambda\).

6.1.1 Cross-Validation across a Range of \(\lambda\)s

Set parameter alpha to 0 to specify Ridge regression.

set.seed(123) 
out <- cv.glmnet(X, y, alpha=0, nfolds=10, standardize=TRUE)

6.1.2 Plot MSE Fit for the Trial \(\lambda\)s

Plot cross-validation results using the specialized glmnet function plot.glmnet(). The visualization also includes the number of predictor variables retained in the model listed along the top as log(\(\lambda\)) increases.

class(out)
[1] "cv.glmnet"
plot(out)

We can get the exact values of \(\lambda_{min}\) and \(\lambda_{1se}\) with print() applied to the output from cv.glmnet().

print(out)

Call:  cv.glmnet(x = X, y = y, nfolds = 10, alpha = 0, standardize = TRUE) 

Measure: Mean-Squared Error 

    Lambda Index Measure    SE Nonzero
min  2.747    82   7.554 1.690      10
1se 11.089    67   9.178 2.512      10

No variable selection, all 10 predictors remain in the model. There is shrinkage of the estimated coefficients as \(\lambda\) increases, and ultimately decrease in fit as \(\lambda\) gets much above exp(2.5) = 12.182

6.1.3 View Range of Trial \(\lambda\)s

lambdas <- out$lambda
as.numeric(.fmt(lambdas, 3))
  [1] 5146.981 4689.737 4273.114 3893.502 3547.614 3232.454 2945.292 2683.640
  [9] 2445.233 2228.005 2030.075 1849.729 1685.404 1535.678 1399.252 1274.947
 [17] 1161.684 1058.483  964.450  878.771  800.704  729.571  664.758  605.703
 [25]  551.894  502.865  458.192  417.488  380.399  346.605  315.814  287.758
 [33]  262.194  238.902  217.678  198.340  180.720  164.666  150.037  136.708
 [41]  124.564  113.498  103.415   94.228   85.857   78.230   71.280   64.948
 [49]   59.178   53.921   49.130   44.766   40.789   37.165   33.864   30.855
 [57]   28.114   25.617   23.341   21.267   19.378   17.657   16.088   14.659
 [65]   13.357   12.170   11.089   10.104    9.206    8.388    7.643    6.964
 [73]    6.345    5.782    5.268    4.800    4.374    3.985    3.631    3.309
 [81]    3.015    2.747    2.503    2.280    2.078    1.893    1.725    1.572
 [89]    1.432    1.305    1.189    1.083    0.987    0.899    0.820    0.747
 [97]    0.680    0.620    0.565    0.515

Ridge regression works with larger values of \(\lambda\) than does the lasso version.

head(lambdas, n=3)
[1] 5146.981 4689.737 4273.114
tail(lambdas, n=3)
[1] 0.6199557 0.5648805 0.5146981

The trial values of \(\lambda\) range from over 5000 to about 1/2.

6.1.4 Examine the Slopes for Various \(\lambda\)s

Another optional but useful step visualizes how the slope coefficients converge to 0 as the value of \(\lambda\) increases.

for (i in 1:ncol(X)) cat(i, colnames(X)[i], "\n")
1 cyl 
2 disp 
3 hp 
4 drat 
5 wt 
6 qsec 
7 vs 
8 am 
9 gear 
10 carb 
fit <- glmnet(X, y, alpha=0, lambda=lambdas, standardize=TRUE)
colors <- getColors("distinct", n=ncol(X))
plot(fit, xvar="lambda", col=colors, lwd=1.5, label=TRUE)
legend("bottomright", lwd=3, col=colors, legend=colnames(X), cex=.7)

From the plot, for the range of \(\lambda\)s examined, unlike lasso regression, none of the coefficients tend to 0. The maximum value of \(\lambda\) reported is about exp(1.5) = 4.482.

6.1.5 Retrieve the Optimal \(\lambda\) for These Data

The value of \(\lambda\) that minimized MSE and the value that resulted in a not-much-larger MSE but much more parsimonious, are included in the output structure out, named lambda.min and lambda.1se, respectively. Here, show the logarithm of the value just to place it on the previous visualization. Let’s choose the \(\lambda\) that provided a more parsimonious model.

lambdaBest <- out$lambda.1se
lambdaBest
[1] 11.08883
log(lambdaBest)
[1] 2.405939

\(\lambda_{Best}=\) 11.089.

6.2 Step 3: Do the Ridge Regression with Chosen \(\lambda\)

6.2.1 Fit the Model

Use lambdaBest to fit the final model with glmnet(). The default value of standardize for the X variables is TRUE, but included here for clarity. If the predictor variables are already scaled in the same units, then perhaps best to not standardize.

Again, store the results of glmnet() in a list object, here named model. List the model’s estimated slope coefficients from the beta component of the list.

model <- glmnet(X, y, alpha=0, lambda=lambdaBest, standardize=TRUE)
coef(model)
11 x 1 sparse Matrix of class "dgCMatrix"
                      s0
(Intercept) -0.399505099
cyl         -0.350861273
disp        -0.005087196
hp          -0.009188777
drat         0.952590375
wt          -0.815179900
qsec         0.148931830
vs           0.860501125
am           1.107400626
gear         0.479472234
carb        -0.342755367

Some estimated coefficients are close to 0 but, unlike lasso regularization, none of the predictor variables have disappeared from the model.

6.2.2 Evaluate Fit

Get residuals from which to compute R-squared to evaluate model fit. Function predict().glmnet calculates the fitted values, \(\hat y\), from the model and the given values of X that remain in the model. Because the fitted model is of class glmnet, can refer to the predict function simply as predict() and R will figure out what version of the function to use.

\(R^2\) equals the correlation of the actual data values, \(y\), with the fitted data values, \(\hat y\). Compute the correlation with the R function cor().

y_hat <- predict(model, X)
rsq <- cor(y, y_hat)^2
rsq
           s0
mpg 0.8424074

\(R^2=\) 0.842 for the lasso regression model with \(\lambda=\) 11.089.

When comparing to standard OLS regression, on the training data the Ridge regression \(R^2=\) 0.842 is a little lower than the OLS \(R^2=0.869\), as it must be given that the OLS model minimizes SSE, the sum of squared errors.