2  Data

The concepts of Lasso and Ridge regression are illustrated with a data set that comes downloaded with R.

2.1 Step 0: Access Needed Packages

(Currently) glmnet is not included in lessR and so must be accessed with its own instructions. Package glmnet contains the function with the same name that does the Lasso and Ridge regressions. Of course, any package not installed with R must first be installed to be able to access.

install.packages("glmnet")

Load the libraries needed for analysis.lessR would be used for reading data with Read(), though no data is read with any read-type function in this particular example. lessR is also used later in this analysis for the standard OLS (ordinary least squares) regression with Regression(), and its getColors() provides a broader range of colors on one of the visualizations than does the default R colors.

2.2 Step 1: Prepare the Data

2.2.1 Get the Data

Instead of reading the data into R from an external data file, get the data from the mtcars data set, included as part of the R download. Access the built-in R data sets with the data() function.

data(mtcars)
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
Variable   Meaning
-------------------------
mpg   Miles/(US) gallon
cyl   Number of cylinders
disp  Displacement (cu.in.)
hp    Gross horsepower
drat  Rear axle ratio
wt    Weight (1000 lbs)
qsec  1/4 mile time
vs    Engine (0 = V-shaped, 1 = straight)
am    Transmission (0 = automatic, 1 = manual)
gear  Number of forward gears
carb  Number of carburetors
nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11

There are 32 rows of data and 11 columns of data in the mtcars data frame. All variables, continuous or categorical, are numeric.

2.2.2 Transform the Data

Because glmnet() is not incorporated into lessR, we need to do the data preparation work. The response variable \(y\) is mpg from the data frame mtcars, to be stored in its own data structure. Store the predictor variables in the X data structure, which includes all the variables in mtcars except mpg.

To implement these data transformations, we optionally use the R pipe operator, |>, to flow the data over the transformations, from left to right in a single statement. The pipe operator provides a more elegant and informative expression than separately listing each transformation in a chain of transformations, creating a separate data frame for each step, though either approach gets the job done. The output of each expression on the left of a |> outputs a data structure into the first parameter value of the expression on the right. That data parameter is then not specified in the corresponding function call. The first parameter of both the scale() and as.matrix() functions is the input data frame, a convention that works well with the pipe operator. The specification of the input data frame is not done in the function call itself but is, instead, specified by the expression on the left side of the previous |>.

Obtain the data for the \(y\) and X data structures by extracting the information from columns of the mtcars data frame using the R function subset(). The parameter select indicates the columns to select. The - indicates all variables except the specified variable.

The values of \(y\) should be centered about 0 for the regularization analysis. Center the values of \(y\) by calculating the mean deviations with the R scale() function so that the transformed \(y\) has a mean of 0. Unless all the predictor variables are expressed on the same scale, they should be standardized. The predictor variables, X, will be standardized in the modeling function glmnet(). [See Section 11.3.2 for a discussion of standardization.]

Input the data to glmnet() in the form of the X and \(y\) data structures. These structures must be of type matrix, not a data frame. A matrix is a simpler structure than a data frame, still rows and columns but all data values are of the same type, in this case, numeric. Use the as.matrix() function to convert X and \(y\) from type data frame to type matrix.

y <- subset(mtcars, select=mpg) |> scale(center=TRUE, scale=FALSE) |> as.matrix() 
X <- subset(mtcars, select=-mpg) |> as.matrix()

When selecting the predictor variables, the features, several different expressions could be used as values of the select parameter. The -mpg indicates to select all variables except mpg. To be more selective, could indicate the range of the adjacent X variables, cyl:carb. Or, for the most customized selectivity, form the vector of selected variables manually with the R c() function, such as, for example, c(cyl, disp, carb) to select just three of the variables.

The advantage of submitting the analysis as a set of matrices is that the specific equation for the model need not be entered, in which each variable name is preceded by a sign. Instead, tens if not hundreds of predictor variables can be entered by including them in one matrix as defined by some basic data wrangling instructions. The same logic would apply if data frames could be submitted instead of matrices.

2.2.3 Verify the Transformations

As always, before plunging into the analysis, confirm that you are analyzing the data you intended to analyze.

Check that \(y\) got properly mean deviated to have a mean of 0.

mean(y)
[1] 0.000000000000004302114

We now have two data structures of R type matrix. The function class() reveals the type of an R object.

class(y)
[1] "matrix" "array" 
class(X)
[1] "matrix" "array" 

Use class() frequently when doing a more complex data analysis to always understand the types of current data structures in an R session.

Matrices look like data frames, with the constraint that all data values are of the same type. The use of as.matrix() to convert the input data frame even retains the data frame row names. Before beginning the analysis, confirm that the data values to be analyzed are the data values intended to be analyzed.

head(y)
                        mpg
Mazda RX4          0.909375
Mazda RX4 Wag      0.909375
Datsun 710         2.709375
Hornet 4 Drive     1.309375
Hornet Sportabout -1.390625
Valiant           -1.990625
head(X)
                  cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1