set.seed(123)
out <- cv.glmnet(X, y, alpha=0, nfolds=10, standardize=TRUE)6 Ridge Regression
For variable selection, the Lasso regression is generally preferred over Ridge regression but Ridge is historically prior. The Ridge modification to the standard least-squares solution plus includes bias in the form of a weighted sum of the squared values of the regression coefficients.
Ridge regression: Estimate the coefficients of the model by minimizing the sum of the squares of the residuals along with a weighted sum of the magnitude of the slope coefficients, which shrinks the larger, traditional OLS regression coefficients.
The function that is minimized is the standard least squares function but with a modification. Add the constraint of the weighted sum of the squared regression coefficients, the intercept and slope coefficients.
\[\textrm{minimize:} \;\; \sum(y_i - \hat y_i)^2 + \lambda \sum b^2_j\]
Ridge regression prevents extreme values of the coefficients. This method may be preferred when all or most features contribute to overall fit and predictability.
6.1 Step 2: Search for the Optimal Value of \(\lambda\)
Same principles as with lasso regression for running glmnet(). Find a best value of \(\lambda\).
6.1.1 Cross-Validation across a Range of \(\lambda\)s
Set parameter alpha to 0 to specify Ridge regression.
6.1.2 Plot MSE Fit for the Trial \(\lambda\)s
Plot cross-validation results using the specialized glmnet function plot.glmnet(). The visualization also includes the number of predictor variables retained in the model listed along the top as log(\(\lambda\)) increases.
class(out)[1] "cv.glmnet"
plot(out)
We can get the exact values of \(\lambda_{min}\) and \(\lambda_{1se}\) with print() applied to the output from cv.glmnet().
print(out)
Call: cv.glmnet(x = X, y = y, nfolds = 10, alpha = 0, standardize = TRUE)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 2.747 82 7.554 1.690 10
1se 11.089 67 9.178 2.512 10
No variable selection, all 10 predictors remain in the model. There is shrinkage of the estimated coefficients as \(\lambda\) increases, and ultimately decrease in fit as \(\lambda\) gets much above exp(2.5) = 12.182
6.1.3 View Range of Trial \(\lambda\)s
lambdas <- out$lambda
as.numeric(.fmt(lambdas, 3)) [1] 5146.981 4689.737 4273.114 3893.502 3547.614 3232.454 2945.292 2683.640
[9] 2445.233 2228.005 2030.075 1849.729 1685.404 1535.678 1399.252 1274.947
[17] 1161.684 1058.483 964.450 878.771 800.704 729.571 664.758 605.703
[25] 551.894 502.865 458.192 417.488 380.399 346.605 315.814 287.758
[33] 262.194 238.902 217.678 198.340 180.720 164.666 150.037 136.708
[41] 124.564 113.498 103.415 94.228 85.857 78.230 71.280 64.948
[49] 59.178 53.921 49.130 44.766 40.789 37.165 33.864 30.855
[57] 28.114 25.617 23.341 21.267 19.378 17.657 16.088 14.659
[65] 13.357 12.170 11.089 10.104 9.206 8.388 7.643 6.964
[73] 6.345 5.782 5.268 4.800 4.374 3.985 3.631 3.309
[81] 3.015 2.747 2.503 2.280 2.078 1.893 1.725 1.572
[89] 1.432 1.305 1.189 1.083 0.987 0.899 0.820 0.747
[97] 0.680 0.620 0.565 0.515
Ridge regression works with larger values of \(\lambda\) than does the lasso version.
head(lambdas, n=3)[1] 5146.981 4689.737 4273.114
tail(lambdas, n=3)[1] 0.6199557 0.5648805 0.5146981
The trial values of \(\lambda\) range from over 5000 to about 1/2.
6.1.4 Examine the Slopes for Various \(\lambda\)s
Another optional but useful step visualizes how the slope coefficients converge to 0 as the value of \(\lambda\) increases.
for (i in 1:ncol(X)) cat(i, colnames(X)[i], "\n")1 cyl
2 disp
3 hp
4 drat
5 wt
6 qsec
7 vs
8 am
9 gear
10 carb
fit <- glmnet(X, y, alpha=0, lambda=lambdas, standardize=TRUE)
colors <- getColors("distinct", n=ncol(X))
plot(fit, xvar="lambda", col=colors, lwd=1.5, label=TRUE)
legend("bottomright", lwd=3, col=colors, legend=colnames(X), cex=.7)
From the plot, for the range of \(\lambda\)s examined, unlike lasso regression, none of the coefficients tend to 0. The maximum value of \(\lambda\) reported is about exp(1.5) = 4.482.
6.1.5 Retrieve the Optimal \(\lambda\) for These Data
The value of \(\lambda\) that minimized MSE and the value that resulted in a not-much-larger MSE but much more parsimonious, are included in the output structure out, named lambda.min and lambda.1se, respectively. Here, show the logarithm of the value just to place it on the previous visualization. Let’s choose the \(\lambda\) that provided a more parsimonious model.
lambdaBest <- out$lambda.1se
lambdaBest[1] 11.08883
log(lambdaBest)[1] 2.405939
\(\lambda_{Best}=\) 11.089.
6.2 Step 3: Do the Ridge Regression with Chosen \(\lambda\)
6.2.1 Fit the Model
Use lambdaBest to fit the final model with glmnet(). The default value of standardize for the X variables is TRUE, but included here for clarity. If the predictor variables are already scaled in the same units, then perhaps best to not standardize.
Again, store the results of glmnet() in a list object, here named model. List the model’s estimated slope coefficients from the beta component of the list.
model <- glmnet(X, y, alpha=0, lambda=lambdaBest, standardize=TRUE)
coef(model)11 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.399505099
cyl -0.350861273
disp -0.005087196
hp -0.009188777
drat 0.952590375
wt -0.815179900
qsec 0.148931830
vs 0.860501125
am 1.107400626
gear 0.479472234
carb -0.342755367
Some estimated coefficients are close to 0 but, unlike lasso regularization, none of the predictor variables have disappeared from the model.
6.2.2 Evaluate Fit
Get residuals from which to compute R-squared to evaluate model fit. Function predict().glmnet calculates the fitted values, \(\hat y\), from the model and the given values of X that remain in the model. Because the fitted model is of class glmnet, can refer to the predict function simply as predict() and R will figure out what version of the function to use.
\(R^2\) equals the correlation of the actual data values, \(y\), with the fitted data values, \(\hat y\). Compute the correlation with the R function cor().
y_hat <- predict(model, X)
rsq <- cor(y, y_hat)^2
rsq s0
mpg 0.8424074
\(R^2=\) 0.842 for the lasso regression model with \(\lambda=\) 11.089.
When comparing to standard OLS regression, on the training data the Ridge regression \(R^2=\) 0.842 is a little lower than the OLS \(R^2=0.869\), as it must be given that the OLS model minimizes SSE, the sum of squared errors.