The concepts of Lasso and Ridge regression are illustrated with a data set that comes downloaded with R.
2.1 Step 0: Access Needed Packages
(Currently) glmnet is not included in lessR and so must be accessed with its own instructions. Package glmnet contains the function with the same name that does the Lasso and Ridge regressions. Of course, any package not installed with R must first be installed to be able to access.
install.packages("glmnet")
Load the libraries needed for analysis.lessR would be used for reading data with Read(), though no data is read with any read-type function in this particular example. lessR is also used later in this analysis for the standard OLS (ordinary least squares) regression with Regression(), and its getColors() provides a broader range of colors on one of the visualizations than does the default R colors.
2.2 Step 1: Prepare the Data
2.2.1 Get the Data
Instead of reading the data into R from an external data file, get the data from the mtcars data set, included as part of the R download. Access the built-in R data sets with the data() function.
Variable Meaning
-------------------------
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors
nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11
There are 32 rows of data and 11 columns of data in the mtcars data frame. All variables, continuous or categorical, are numeric.
2.2.2 Transform the Data
Because glmnet() is not incorporated into lessR, we need to do the data preparation work. The response variable \(y\) is mpg from the data frame mtcars, to be stored in its own data structure. Store the predictor variables in the X data structure, which includes all the variables in mtcars except mpg.
To implement these data transformations, we optionally use the R pipe operator, |>, to flow the data over the transformations, from left to right in a single statement. The pipe operator provides a more elegant and informative expression than separately listing each transformation in a chain of transformations, creating a separate data frame for each step, though either approach gets the job done. The output of each expression on the left of a |> outputs a data structure into the first parameter value of the expression on the right. That data parameter is then not specified in the corresponding function call. The first parameter of both the scale() and as.matrix() functions is the input data frame, a convention that works well with the pipe operator. The specification of the input data frame is not done in the function call itself but is, instead, specified by the expression on the left side of the previous |>.
Obtain the data for the \(y\) and X data structures by extracting the information from columns of the mtcars data frame using the R function subset(). The parameter select indicates the columns to select. The - indicates all variables except the specified variable.
The values of \(y\) should be centered about 0 for the regularization analysis. Center the values of \(y\) by calculating the mean deviations with the R scale() function so that the transformed \(y\) has a mean of 0. Unless all the predictor variables are expressed on the same scale, they should be standardized. The predictor variables, X, will be standardized in the modeling function glmnet(). [See Section 11.3.2 for a discussion of standardization.]
Input the data to glmnet() in the form of the X and \(y\) data structures. These structures must be of type matrix, not a data frame. A matrix is a simpler structure than a data frame, still rows and columns but all data values are of the same type, in this case, numeric. Use the as.matrix() function to convert X and \(y\) from type data frame to type matrix.
y <-subset(mtcars, select=mpg) |>scale(center=TRUE, scale=FALSE) |>as.matrix() X <-subset(mtcars, select=-mpg) |>as.matrix()
When selecting the predictor variables, the features, several different expressions could be used as values of the select parameter. The -mpg indicates to select all variables except mpg. To be more selective, could indicate the range of the adjacent X variables, cyl:carb. Or, for the most customized selectivity, form the vector of selected variables manually with the R c() function, such as, for example, c(cyl, disp, carb) to select just three of the variables.
The advantage of submitting the analysis as a set of matrices is that the specific equation for the model need not be entered, in which each variable name is preceded by a sign. Instead, tens if not hundreds of predictor variables can be entered by including them in one matrix as defined by some basic data wrangling instructions. The same logic would apply if data frames could be submitted instead of matrices.
2.2.3 Verify the Transformations
As always, before plunging into the analysis, confirm that you are analyzing the data you intended to analyze.
Check that \(y\) got properly mean deviated to have a mean of 0.
mean(y)
[1] 0.000000000000004302114
We now have two data structures of R type matrix. The function class() reveals the type of an R object.
class(y)
[1] "matrix" "array"
class(X)
[1] "matrix" "array"
Use class() frequently when doing a more complex data analysis to always understand the types of current data structures in an R session.
Matrices look like data frames, with the constraint that all data values are of the same type. The use of as.matrix() to convert the input data frame even retains the data frame row names. Before beginning the analysis, confirm that the data values to be analyzed are the data values intended to be analyzed.