Machine Learning - 7 Pre-Process Features

7.1 Linear Transformation

7.1.1 Definition

Before conducting the regression analysis, data may be processed in various ways to potentially enhance the analysis. Pre-processing typically includes an evaluation of missing data, either dropping columns or rows with too much missing data or imputing missing values if not so frequent. Pre-processing should also include a search for outliers, which may also involve some preliminary regression analyses to obtain influence statistics such as Cook’s Distance.

Another type of data pre-processing involves data transformations. One such transformations is re-scaling the metric of a variable. Rescaling is a linear transformation of data.

Linear transformation: Express the transformed variable as a linear function of the input variable, add a constant and multiply each data value by its weight, another constant.

A linear function transforms the input \(x\) to the output \(y\) with \(y=b_0 + b_1x\). An example is converting measurement of length in feet to inches in which the data values are divided by 12, that is, \(b_1= 1/12\) and \(b_0=0\). Another example is the conversion of Fahrenheit temperature to Celsius: Subtract 32 from each data value, then divide by 1.8, that is, \(b_1=1/1.8\) and \(b_0=-32\). A key aspect of a linear transformation is that the shape of the data remains unchanged.

7.1.2 Impact on Regression

Consider the impact of a linear transformation of a predictor variable on the regression analysis. In general, the predictor variables are measured in different units. Consider the example of the predicting MPGhiway from various physical dimensions of the cars, here Weight in lbs and Wheelbase in inches. Which predictor variable is the most important? The estimated slope coefficients cannot be compared directly because of the different units of measurement.

In this example, save the output of reg_brief() to an R data structure called a list, which is the form of output of the lessR Regression() function. Here name the output list r. The advantage of writing output to the list structure is the ability to display only one component of that list. Here display only the estimated model output, referenced by its component name out_estimates. (To see all the available components, enter names(r). To access more details, view the manual by entering help(reg).)

Because we are only interested here in the estimated model, we can save a little computation time by using the reg_brief() of reg().

d <- Read("Cars93", quiet=TRUE)

r <- reg_brief(MPGhiway ~ Weight + Wheelbase, graphics=FALSE)
r$out_estimates

             Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
(Intercept)    29.850      7.013    4.256    0.000      15.918      43.783
     Weight    -0.010      0.001   -9.569    0.000      -0.012      -0.008
  Wheelbase     0.298      0.093    3.192    0.002       0.113       0.484

For this (training) data set, for each additional pound of Weight, with Wheelbase held constant, on average, MPGhiway decreases 0.010 MPG. For each additional inch of Wheelbase, with Weight held constant, MPGhiway increases 0.298.

These results are useful, but the size of the corresponding partial slope coefficients are arbitrary because the units of each variable are arbitrary. Consider re-scaling Weight from lbs to tons. Copy the d data frame to the d_tons data frame, leaving the original data frame unmodified. Divide Weight by 2000 in the new data frame to convert to tons.

d_tons <- d
d_tons$Weight <- d_tons$Weight / 2000

Re-do the regression analysis with the new data frame with Weight measured in tons. To isolate just the regression estimates part of the output, save the output of reg_brief() into an object called r_tons and then display only the named component out_estimates. Also, turn off graphics.

r_tons <- reg_brief(MPGhiway ~ Weight + Wheelbase, graphics=FALSE, data=d_tons)
r_tons$out_estimates

             Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
(Intercept)    29.850      7.013    4.256    0.000      15.918      43.783
     Weight   -20.663      2.159   -9.569    0.000     -24.953     -16.373
  Wheelbase     0.298      0.093    3.192    0.002       0.113       0.484

With the re-scaling of Weight, the value of its slope coefficient decreased from -0.010 to -20.663. That is, for each additional ton of Weight, with Wheelbase held constant, on average, MPGhiway decreases 20.663 MPG. The reason for the dramatically larger partial slope coefficient for the measurement of Weight in tons is that 1 ton is dramatically heavier than 1 lb. Fewer tons accomplish the same reduction in MGPhiway than lbs.

The transformation of Weight from lbs to tons is an example of a linear transformation. Express the transformation from lbs to tons as,

\[X_{tons} = 0 + (\dfrac{1}{2000})X_{lbs}\]

When converting lbs to tons, the linear transformation coefficients are \(a=0\) and \(b = \dfrac{1}{2000}\). The useful property of a linear transformation is that fundamental properties of the variable as it relates to other variables are preserved. For example, notice that the \(t\)-values and so also the corresponding \(p\)-values of the hypothesis tests of \(\beta=0\) are the same regardless of the units in which the predictor variables are expressed. For either scaling of Weight, the \(t\)-value is \(-9.569\).

Other linear transformations rescale all the predictor variables so that they are all assessed on the same scale, as explained next.

7.2 Standardization

7.2.1 Standardize with \(z\)-values

One popular method for expressing the predictor variables (features) in the same metric converts each value of a feature to the number of standard deviations the value is from the feature’s mean. How many standard deviations is a value from its mean? To express the distance of the \(i^{th}\) data value from its mean in terms of standard deviations:

population: \(z_i = \dfrac{Y_i - \mu}{\sigma}\), for a given distribution, with constant mean, \(\mu\), and standard deviation, \(\sigma\)
sample: \(z_i = \dfrac{Y_i - m}{s}\), for a given distribution, with constant mean, \(m\), and standard deviation, \(s\)

Regardless if the original measures of Y are in dollars or kilograms, the corresponding z-values are expressed in terms of standard deviations.

Standardized values are expressed in terms of standard deviations in place of the original units of measurements.

The \(z\)-value for a data value is the same regardless of the scale of measurement in which the data value was measured.

Each individual value from any distribution for generic variable Y can be rescaled to a \(z\)-value, that is, standardized, providing two measurement scales, Y and Z. The concept of standardization applies to any distribution, but the standardized normal distribution is particularly useful because of the well-known normal curve probabilities.

To illustrate, compare scores from two different classroom tests in two different ways.

Absolute position: Assessment of the position of one value in a distribution of values in terms of its magnitude, irrespective of the other values within the distribution.

Relative position: Assessment of the position of one value in a distribution of values compared to the position of the other values within the distribution.

Each of two groups of 18 newly hired employees were administered a different performance evaluation test, each test with a different number of items and standard deviation. Call the tests, Test A and Test B. Scores on Test A, Variable \(Y_A\), ranged from 54 to a perfect score of 60, with \(m=56\) and \(s=1.782\). Scores on Test B, Variable \(Y_B\), ranged from 23 to a perfect score of 80, with the same \(m=56\), but a much larger standard deviation, \(s=16.606\).

Data: http://web.pdx.edu/~gerbing/data/TestScores.csv

How can scores be compared across the two tests? Get standardized values, \(z\)-values, with the R scale() function. Create the variable of \(z\)-values, YA.z, within the d data frame.

> d$YA.z <- scale(d$YA)

Consider the test scores from Test A, Variable YA.

   YA  YA.z
 ----  ----
 1 60  2.24
 2 59  1.68
 3 58  1.12
 4 57  0.56
 5 57  0.56
 6 57  0.56
 7 56  0.00
 8 56  0.00
 9 56  0.00
10 56  0.00
11 56  0.00
12 55 -0.56
13 55 -0.56
14 54 -1.12
15 54 -1.12
16 54 -1.12
17 54 -1.12
18 54 -1.12

The first, and highest, score is \(Y_1=60\), with a corresponding \(z\)-score of,

\[z_1 = \dfrac{Y_1-m}{s}= \dfrac{60-56}{1.782} = 2.24\]

The test score of 60 is 2.24 standard deviations above the mean. Similarly, the lowest score, \(Y_{18}\), is 1.12 standard deviations below the mean. Everyone did well in absolute scores, with scores ranging from 90% to 100%. However, for the \(z\)-scores, the lowest scores of 90% correct were over a full standard deviation below the mean. Relative to the other test takers, a score of 90% indicates poor performance.

Now consider the scores of Test B, Variable \(Y_B\).

> d$YB.z <- scale(d$YB)

   YB  YB.z
 ----  ----
 1 80  1.45
 2 80  1.45
 3 78  1.32
 4 74  1.08
 5 69  0.78
 6 65  0.54
 7 62  0.36
 8 60  0.24
 9 55 -0.06
10 54 -0.12
11 52 -0.24
12 52 -0.24
13 47 -0.54
14 45 -0.66
15 43 -0.78
16 36 -1.20
17 33 -1.39
18 23 -1.99

These test scores are quite variable, with two people getting perfect scores of 80 and the lowest performer only achieving 23 out of 80 items, for 28.75%. Even though the low scores were dramatically low, even the lowest score is less than two standard deviations below the mean, with \(z_{18}=-1.99\).

Comparing the absolute scores in the two distributions demonstrated two different patterns, one in which everyone did well vs one with considerably more variability. Despite the differences in absolute scores, however, the ranges of \(z\)-scores are reasonably comparable across the two distributions.
Converting to standard scores shifts the metric from absolute to relative performance.

The \(z\)-value indicates the relative position of the corresponding data value of variable Y within the distribution.

Assess the absolute position of a value with the original distribution of measurements, \(Y_i\). Or, express a value in terms of the standardized or \(z\)-value, the transformed measurement.

7.2.2 Regression with Standardized Variables

How to compare partial slope coefficients? Standardization provides a way to express all variables in the same unit: the standard deviation. Manually transform each predictor variable with the R function scale(). Or, with less work, set the lessR Regression() function parameter new_scale to "z".

r_z <- reg_brief(MPGhiway ~ Weight + Wheelbase, new_scale="z", graphics=FALSE)


Rescaled Data, First Six Rows
        MPGhiway Weight Wheelbase
Integra       31 -0.624    -0.285
Legend        25  0.826     1.621
90            26  0.512    -0.285
100           26  0.563     0.301
535i          30  0.961     0.741
Century       31 -0.327     0.155

r_z$out_estimates

             Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
(Intercept)    29.086      0.310   93.749    0.000      28.469      29.702
     Weight    -6.094      0.637   -9.567    0.000      -7.359      -4.828
  Wheelbase     2.032      0.637    3.190    0.002       0.767       3.297

The interpretation follows the same pattern before standardization but with different units. For this (training) data set, for each additional standard deviation of Weight, with Wheelbase held constant, on average, MPGhiway decreases 6.094 MPG. For each additional standard deviation of Wheelbase, with Weight held constant, MPGhiway increases 2.032. Now the coefficients may be compared to each other because they are expressed in terms of the same units: standard deviation.

To illustrate further, add a non-relevant predictor variable to the model, RPM.

r_z2 <- reg_brief(MPGhiway ~ Weight + Wheelbase + RPM, new_scale="z", graphics=FALSE)


Rescaled Data, First Six Rows
        MPGhiway Weight Wheelbase    RPM
Integra       31 -0.624    -0.285  1.708
Legend        25  0.826     1.621  0.368
90            26  0.512    -0.285  0.368
100           26  0.563     0.301  0.368
535i          30  0.961     0.741  0.703
Century       31 -0.327     0.155 -0.135

r_z2$out_estimates

             Estimate    Std Err  t-value  p-value   Lower 95%   Upper 95%
(Intercept)    29.086      0.312   93.229    0.000      28.466      29.706
     Weight    -6.092      0.641   -9.501    0.000      -7.366      -4.818
  Wheelbase     2.039      0.656    3.110    0.003       0.736       3.342
        RPM     0.018      0.355    0.051    0.959      -0.688       0.724

With the standardized solution, the size of the corresponding RPM coefficient, 0.018, is considerably smaller than the more impactful coefficients of Weight and Wheelbase. An increase of one standard deviation of RPM, with Weight and Wheelbase held constant, has little impact on average MPGhiway.

7.3 Other Rescalings

7.3.1 Min-Max Scaling

A common rescaling used in machine learning transforms the variables to have the same minimum and maximum values.

Range of a variable: Maximum value minus the minimum value.

Min-Max Scaling: Convert all data values for a variable so that the minimum value is 0 and the maximum value is 1.

Write this transformation of variable \(x\) for the values of variable \(x\) as:¹ \[y = \frac{x-min(x)}{max(x)-min(x)} = \frac{x-min(x)}{range(x)}\] Express as a linear function by setting the parameters of a linear function of one variable to \(a=-(min(x)/range(x))\) and \(b=1/range(x)\).

¹ Python sklearn: The MinMaxScaler provides the needed transformation from the minimum and maximum values of each variable to which the transformation is applied. To accomplish the min-max rescaling with the lessR Regression() function, set parameter new_scale to "0to1".

7.3.2 Robust Standardization

Robust scaling resembles standardization, except it is more robust to the presence of outliers. That is, the presence of outliers does not dramatically change the resulting scaled values as much as standardization in which an outlier can have a big impact on the mean and an even bigger impact on increasing the size of the standard deviation (which depends on squared deviation scores).

Robust scaling accomplishes this robustness by replacing the mean with the more robust median, and the standard deviation with the more robust interquartile range.² The median is the second quartile, and the IQR is the difference of the third and first quartiles. Unlike the mean and standard deviation, no matter how extreme a few values are in a distribution, the quartiles remain the same.

² To accomplish the robust re-scaling with the lessR Regression() function, set parameter new_scale to "robust".

\[zr_i = \dfrac{Y_i - median}{IQR}\]

For a given distribution, the inter-quartile range, IQR, remains the same if the largest value of the distribution triples in size, while the standard deviation will increase, dramatically for a small distribution.