5  Compare Lasso and OLS

As we saw, Lasso does an excellent job at variable selection but then so does Best Subsets, such as available in lessR Regression(). The advantage of Lasso would be the analysis of really large models with hundreds of predictor variables. Models such as these do occur in the era of big data.

The most intriguing advantage of Lasso regression is the improved fit, MSE and \(R^2\) on new data, better than provided by OLS regression, though this advantage was not demonstrated in this analysis. The value of \(\lambda\) was derived from \(k\)-fold cross-validations assessed against the corresponding values of MSE. The reported \(R^2\) statistics, however, are based on describing the fit of the model to the training data. What we do not see from the glmnet() output is the value of \(R^2_{PRESS}\), which is calculated over \(n-1\) folds.

The model comparison follows.

Three predictors.

Of course, for the training data, the OLS model from is guaranteed to minimize SSE, the sum of squared errors, and therefor maximize \(R^2\). And that comparison is necessarily true regardless of a chosen alternate optimization function, including the Lasso model. Of greater interest is a comparison of their \(R^2_{PRESS}\) values, or even better, comparison on testing data.

Two predictors.

Interpretation: Two variables are needed to predict miles per gallon, number of cylinders in the engine and the weight of the vehicle.

The real questions of interest are the comparison across the Lasso and OLS models of:

We have no answers to these questions from the output of glmnet() and no way to choose which model for which to predict fuel mileage. The Lasso technique should provide better fit to new data but to empirically demonstrate this result we would need a bigger data set and a more complex example that did the training/test data split. We do know, however, that OLS wins the training data fit award but loses to Lasso on the more important test data fit award. To optimize the bias-variance trade-off, add a little bias to reduce variability of the fitted y, \(\hat y\), across data sets.