Before conducting the regression analysis, data may be processed in various ways to potentially enhance the analysis. Pre-processing typically includes an evaluation of missing data, either dropping columns or rows with too much missing data or imputing missing values if not so frequent. Pre-processing should also include a search for outliers, which may also involve some preliminary regression analyses to obtain influence statistics such as Cook’s Distance.
Another type of data pre-processing involves data transformations. One such transformations is re-scaling the metric of a variable. Rescaling is a linear transformation of data.
Linear transformation: Express the transformed variable as a linear function of the input variable, add a constant and multiply each data value by its weight, another constant.
A linear function transforms the input \(x\) to the output \(y\) with \(y=b_0 + b_1x\). An example is converting measurement of length in feet to inches in which the data values are divided by 12, that is, \(b_1= 1/12\) and \(b_0=0\). Another example is the conversion of Fahrenheit temperature to Celsius: Subtract 32 from each data value, then divide by 1.8, that is, \(b_1=1/1.8\) and \(b_0=-32\). A key aspect of a linear transformation is that the shape of the data remains unchanged.
7.1.2 Impact on Regression
Consider the impact of a linear transformation of a predictor variable on the regression analysis. In general, the predictor variables are measured in different units. Consider the example of the predicting MPGhiway from various physical dimensions of the cars, here Weight in lbs and Wheelbase in inches. Which predictor variable is the most important? The estimated slope coefficients cannot be compared directly because of the different units of measurement.
In this example, save the output of reg_brief() to an R data structure called a list, which is the form of output of the lessR Regression() function. Here name the output list r. The advantage of writing output to the list structure is the ability to display only one component of that list. Here display only the estimated model output, referenced by its component name out_estimates. (To see all the available components, enter names(r). To access more details, view the manual by entering help(reg).)
Because we are only interested here in the estimated model, we can save a little computation time by using the reg_brief() of reg().
d <-Read("Cars93", quiet=TRUE)
r <-reg_brief(MPGhiway ~ Weight + Wheelbase, graphics=FALSE)r$out_estimates
For this (training) data set, for each additional pound of Weight, with Wheelbase held constant, on average, MPGhiway decreases 0.010 MPG. For each additional inch of Wheelbase, with Weight held constant, MPGhiway increases 0.298.
These results are useful, but the size of the corresponding partial slope coefficients are arbitrary because the units of each variable are arbitrary. Consider re-scaling Weight from lbs to tons. Copy the d data frame to the d_tons data frame, leaving the original data frame unmodified. Divide Weight by 2000 in the new data frame to convert to tons.
d_tons <- dd_tons$Weight <- d_tons$Weight /2000
Re-do the regression analysis with the new data frame with Weight measured in tons. To isolate just the regression estimates part of the output, save the output of reg_brief() into an object called r_tons and then display only the named component out_estimates. Also, turn off graphics.
With the re-scaling of Weight, the value of its slope coefficient decreased from -0.010 to -20.663. That is, for each additional ton of Weight, with Wheelbase held constant, on average, MPGhiway decreases 20.663 MPG. The reason for the dramatically larger partial slope coefficient for the measurement of Weight in tons is that 1 ton is dramatically heavier than 1 lb. Fewer tons accomplish the same reduction in MGPhiway than lbs.
The transformation of Weight from lbs to tons is an example of a linear transformation. Express the transformation from lbs to tons as,
\[X_{tons} = 0 + (\dfrac{1}{2000})X_{lbs}\]
When converting lbs to tons, the linear transformation coefficients are \(a=0\) and \(b = \dfrac{1}{2000}\). The useful property of a linear transformation is that fundamental properties of the variable as it relates to other variables are preserved. For example, notice that the \(t\)-values and so also the corresponding \(p\)-values of the hypothesis tests of \(\beta=0\) are the same regardless of the units in which the predictor variables are expressed. For either scaling of Weight, the \(t\)-value is \(-9.569\).
Other linear transformations rescale all the predictor variables so that they are all assessed on the same scale, as explained next.
7.2 Standardization
7.2.1 Standardize with \(z\)-values
One popular method for expressing the predictor variables (features) in the same metric converts each value of a feature to the number of standard deviations the value is from the feature’s mean. How many standard deviations is a value from its mean? To express the distance of the \(i^{th}\) data value from its mean in terms of standard deviations:
population: \(z_i = \dfrac{Y_i - \mu}{\sigma}\), for a given distribution, with constant mean, \(\mu\), and standard deviation, \(\sigma\)
sample: \(z_i = \dfrac{Y_i - m}{s}\), for a given distribution, with constant mean, \(m\), and standard deviation, \(s\)
Regardless if the original measures of Y are in dollars or kilograms, the corresponding z-values are expressed in terms of standard deviations.
Standardized values are expressed in terms of standard deviations in place of the original units of measurements.
The \(z\)-value for a data value is the same regardless of the scale of measurement in which the data value was measured.
Each individual value from any distribution for generic variable Y can be rescaled to a \(z\)-value, that is, standardized, providing two measurement scales, Y and Z. The concept of standardization applies to any distribution, but the standardized normal distribution is particularly useful because of the well-known normal curve probabilities.
To illustrate, compare scores from two different classroom tests in two different ways.
Absolute position: Assessment of the position of one value in a distribution of values in terms of its magnitude, irrespective of the other values within the distribution.
Relative position: Assessment of the position of one value in a distribution of values compared to the position of the other values within the distribution.
Each of two groups of 18 newly hired employees were administered a different performance evaluation test, each test with a different number of items and standard deviation. Call the tests, Test A and Test B. Scores on Test A, Variable \(Y_A\), ranged from 54 to a perfect score of 60, with \(m=56\) and \(s=1.782\). Scores on Test B, Variable \(Y_B\), ranged from 23 to a perfect score of 80, with the same \(m=56\), but a much larger standard deviation, \(s=16.606\).
How can scores be compared across the two tests? Get standardized values, \(z\)-values, with the R scale() function. Create the variable of \(z\)-values, YA.z, within the d data frame.
> d$YA.z <- scale(d$YA)
Consider the test scores from Test A, Variable YA.
The test score of 60 is 2.24 standard deviations above the mean. Similarly, the lowest score, \(Y_{18}\), is 1.12 standard deviations below the mean. Everyone did well in absolute scores, with scores ranging from 90% to 100%. However, for the \(z\)-scores, the lowest scores of 90% correct were over a full standard deviation below the mean. Relative to the other test takers, a score of 90% indicates poor performance.
Now consider the scores of Test B, Variable \(Y_B\).
These test scores are quite variable, with two people getting perfect scores of 80 and the lowest performer only achieving 23 out of 80 items, for 28.75%. Even though the low scores were dramatically low, even the lowest score is less than two standard deviations below the mean, with \(z_{18}=-1.99\).
Comparing the absolute scores in the two distributions demonstrated two different patterns, one in which everyone did well vs one with considerably more variability. Despite the differences in absolute scores, however, the ranges of \(z\)-scores are reasonably comparable across the two distributions.
Converting to standard scores shifts the metric from absolute to relative performance.
The \(z\)-value indicates the relative position of the corresponding data value of variable Y within the distribution.
Assess the absolute position of a value with the original distribution of measurements, \(Y_i\). Or, express a value in terms of the standardized or \(z\)-value, the transformed measurement.
7.2.2 Regression with Standardized Variables
How to compare partial slope coefficients? Standardization provides a way to express all variables in the same unit: the standard deviation. Manually transform each predictor variable with the R function scale(). Or, with less work, set the lessR Regression() function parameter new_scale to "z".
The interpretation follows the same pattern before standardization but with different units. For this (training) data set, for each additional standard deviation of Weight, with Wheelbase held constant, on average, MPGhiway decreases 6.094 MPG. For each additional standard deviation of Wheelbase, with Weight held constant, MPGhiway increases 2.032. Now the coefficients may be compared to each other because they are expressed in terms of the same units: standard deviation.
To illustrate further, add a non-relevant predictor variable to the model, RPM.
With the standardized solution, the size of the corresponding RPM coefficient, 0.018, is considerably smaller than the more impactful coefficients of Weight and Wheelbase. An increase of one standard deviation of RPM, with Weight and Wheelbase held constant, has little impact on average MPGhiway.
7.3 Other Rescalings
7.3.1 Min-Max Scaling
A common rescaling used in machine learning transforms the variables to have the same minimum and maximum values.
Range of a variable: Maximum value minus the minimum value.
Min-Max Scaling: Convert all data values for a variable so that the minimum value is 0 and the maximum value is 1.
Write this transformation of variable \(x\) for the values of variable \(x\) as:1\[y = \frac{x-min(x)}{max(x)-min(x)} = \frac{x-min(x)}{range(x)}\] Express as a linear function by setting the parameters of a linear function of one variable to \(a=-(min(x)/range(x))\) and \(b=1/range(x)\).
1 Python sklearn: The MinMaxScaler provides the needed transformation from the minimum and maximum values of each variable to which the transformation is applied. To accomplish the min-max rescaling with the lessR Regression() function, set parameter new_scale to "0to1".
7.3.2 Robust Standardization
Robust scaling resembles standardization, except it is more robust to the presence of outliers. That is, the presence of outliers does not dramatically change the resulting scaled values as much as standardization in which an outlier can have a big impact on the mean and an even bigger impact on increasing the size of the standard deviation (which depends on squared deviation scores).
Robust scaling accomplishes this robustness by replacing the mean with the more robust median, and the standard deviation with the more robust interquartile range.2 The median is the second quartile, and the IQR is the difference of the third and first quartiles. Unlike the mean and standard deviation, no matter how extreme a few values are in a distribution, the quartiles remain the same.
2 To accomplish the robust re-scaling with the lessR Regression() function, set parameter new_scale to "robust".
\[zr_i = \dfrac{Y_i - median}{IQR}\]
For a given distribution, the inter-quartile range, IQR, remains the same if the largest value of the distribution triples in size, while the standard deviation will increase, dramatically for a small distribution.