1 Regularization

1.1 Definition

There are several different types of regression analysis that modify the standard least-squares procedure, often called OLS for ordinary least squares. One modification is regularization. The principles of regularization presented here are in the context of regression analysis but apply to supervised machine learning procedures in general.

Regularization is a method that deliberately introduces bias into the least-squares estimation procedure.

Bias: The estimated values, here slope coefficients from a regression model, are not optimal estimates for the data on which the model is estimated, the training data.

The result is that a regression regularization model is not optimized to the training data, with worse fit on the training data than an OLS model.

Why introduce bias? The reason is overfitting, where the estimated model learns from random sampling fluctuations peculiar to the data on which the model trained. We care about fit on the training data only because some level of training data fit is necessary for what we ultimately care about: Fit on new, previously unseen data. It is on new data that predictions are made. A successful supervised machine learning model optimizes fit on new data, such as testing data sets created from a \(k\)-fold cross validation.

The larger a slope coefficient, the more important is the corresponding variable in predicting the target value. The issue is that large values of the slope coefficients can lead to over-fitting that result from too much focus on a random sampling fluctuation. A regularization method shrinks the size of the slope coefficients, which is then less likely to rely upon idiosyncratic aspects of the arbitrary sample of training data to estimate the model. Although performance on the training data may be slightly reduced due to the added bias, performance on new data should be enhanced, which means variance across new data sets is reduced.

Variance: Amount of change in the model and consequently the estimated fitted value changing across different samples of data.

As always with supervised machine learning procedures, search for a solution that optimizes the trade-off between bias and variance. A regularization method adds bias to the minimization procedure, but with the benefit of reducing the size of the estimated slope coefficients which reduces the variance. [See Section 12.5 of the textbook, particularly 12.5.2, for a discussion of the bias-variance trade-off and overfitting.]

Two primary regularization methods are Lasso regression and Ridge regression. Elastic net regression represents a continuum that combines both methods. The historically prior Ridge regression gained widespread use during the last quarter of the 20th century. Lasso regression was not fully developed with accessible software until almost the second decade of the 21st century.

Although both Ridge regression and Lasso regression tend to reduce the size of the regression coefficients, the primary practical distinction between the methods is that Lasso regression potentially reduces some coefficients to zero. Coefficients of zero effectively drop the corresponding feature from the model, so Lasso regression becomes a method of feature selection, retaining only the most valuable predictors. In many situations, Lasso regression is preferred because it tends to retain one of a set of collinear predictor variables and drop the rest. Ridge regression, instead, tends to shrink the estimated coefficients of a collinear set together.

Regularization methods minimize the Sum of the Squared Errors (SSE) of standard Ordinary Least Squares (OLS) regression, but with a constraint. Regularization regression is constrained by a term that includes the weighting parameter lambda, \(\lambda\) that is added to the least-squares minimization function. The larger the value of \(\lambda\), the more constrained the estimated regression coefficients, so the more the impact of the regularization. When performing either method of regularization, generally experiment with different values of \(\lambda\) to identify an optimal value in terms of maximizing fit, subject, and validation on test data.

To penalize all the estimated regression coefficients equally, standardize, or at least mean deviate, the predictor variables and the response variable before doing the regularization regression.

1.2 Lasso Regression

Lasso abbreviates the phrase Least Absolute Selection and Shrinkage Operator. Lasso modifies the standard least-squares solution, called OLS for ordinary least squares, by adding bias in the form of a sum of the absolute values of the regression coefficients, weighted by a coefficient called lambda, \(\lambda\).

Lasso regression: Estimate the coefficients of the model by minimizing the sum of the squares of the residuals simultaneously with a weighted sum of the magnitude of the slope coefficients, which shrinks the larger, traditionally derived OLS regression coefficients.

Lasso regression minimizes the least squared errors but with a modification. Add the constraint of the weighted sum of the magnitude of the slope coefficients to the minimization equation. [See Section 11.3.1 for a discussion of the least-squares criterion.]

\[\textrm{minimize:} \;\; \sum(y_i - \hat y_i)^2 + \lambda \sum |b_j|\]

Machine learning with Lasso regression minimizes the sum of the squared residuals, SSE, but simultaneously with a second constraint: A weighted function of the sum of the magnitude of the slope coefficients. Lasso regression selects slope coefficients as small as possible while also minimizing SSE. There is no direct solution for the Lasso regression slope coefficients, \(b_j\)’s, so Lasso regression implements an iterative approach.

To do Lasso regression analysis, choose an optimal value of \(\lambda\), a value that results in some shrinkage of the coefficients toward 0 but retains the more effective features.

If \(\lambda\) = 0, the sum of the magnitude of the slope coefficients constraint disappears, resulting in the usual least-squares linear regression
If \(\lambda\) = very large, all regression coefficients shrink towards zero

The higher the value of \(\lambda\), the more the slope coefficients shrink toward 0.

High values of \(\lambda\) can disappear all of the slope coefficients. In practice, usually choose an intermediate value of \(\lambda\) that effectively drops some coefficients and retains others according to the properties of the specific model estimated. Dropping underperforming predictor variables selects the remaining coefficients that do contribute to overall fit and predictability.

1.3 Ridge Regression

For variable selection, the Lasso regression is generally preferred over Ridge regression but Ridge is historically prior. A discussion of the topic is included here for completeness. The Ridge modification to the standard least-squares solution plus includes bias in the form of a weighted sum of the squared values of the regression coefficients.

Ridge regression: Estimate the coefficients of the model by minimizing the sum of the squares of the residuals along with a weighted sum of the magnitude of the slope coefficients, which shrinks the larger, traditional OLS regression coefficients.

The function that is minimized is the standard least squares function but with a modification. Add the constraint of the weighted sum of the squared regression coefficients, the intercept and slope coefficients.

\[\textrm{minimize:} \;\; \sum(y_i - \hat y_i)^2 + \lambda \sum b^2_j\]

Ridge regression prevents extreme values of the coefficients. This method may be preferred when all or most features contribute to overall fit and predictability.