9  Assumptions and Transformations

9.1 Assumptions

As with any statistical procedure, the validity of the analysis requires satisfying the underlying assumptions. If the assumptions are violated, the output of the analysis cannot be trusted. The estimated coefficients may be biased, the fit indices may be over generous, or the significance test of each coefficient may have the wrong \(p\)-value.

The assumptions for regression analysis focus on the properties of the residuals.

Core assumption of least-squares regression: Residuals only reflect random, chance fluctuations.

The assumption implies that the model has accounted for all the structure in the data. After the fitted value is computed, all that remains is random, unaccountable error.

Each way in which the residuals can exhibit non-randomness corresponds to a specific assumption. Any systematic content of the residual variable violates one or more of the assumptions. If so, explicitly revise the model to account for this systematic information instead of relegating it to the error term. Often this correction includes adding one or more predictor variables, accounting for a nonlinear relationship, or using an estimation procedure other than least-squares.

The least-squares estimation procedure requires the following three conditions.

  • the average residual value should be zero for each value of the predictor variable
  • the standard deviation of the residuals should be the same, to within sampling error, for each value of the predictor variable
  • residuals do not correlate with any other variable, including, for data values collected over time, with residuals at other time periods

A fourth assumption pertains to the appropriateness of the significance tests of the estimated coefficients.

  • the residuals are normally distributed

Next, examine the meaning and detection of violations of these assumptions.

9.1.1 Assumption 1

For each value of \(\hat y_i\), that is, for each vertical line drawn through the scatter plot, the residuals about the corresponding point on the line should be approximately evenly distributed in the positive and negative regions. Moving across the values of \(x\) from low to high, the patterning of positive and negative residuals should be random, without any discernible structure. As shown below, lack of this randomness can indicate non-linearity in the data.

To facilitate this comparison the graph contains a dotted horizontal line drawn through the origin. If the residuals for individual values of \(\hat y_i\) are not evenly balanced about the horizontal zero line, the relationship between response and predictor variables is likely not linear as specified.

To visually evaluate this assumption for a model with a single feature, examine a scatter plot of the target and feature variables. Figure @ref(fig:quad) shows some very nonlinear data, and two associated curves/lines plotted through the data.

Plot(x, y, fit="loess", fit_se=0)

Non-linear data with an approximate quadratic curve of best fit.

Figure @ref(fig:quadreg) shows the regression line plotted through these data. The line is flat, \(b_1 = 0\) to within two decimal digits. Note that the analysis correctly shows that there is no linear relationship between the variables \(x\) and \(y\). Note also that no linear relationship does not imply no relationship.

Plot(x, y, fit="lm", fit_se=0)

Non-linear data with best-fitting linear regression line.

Figure @ref(fig:quadreg2) illustrates the non-random structure of the residuals. A residual is defined as \(y_i - \hat y_i\), where \(\hat y_i\) is a point on the regression line. Accordingly, points below the regression line have negative residuals, and points above the line have positive residuals.

Non-linear data with best-fitting linear regression line that displays non-random regions of positive and negative residuals.

Negative residuals relative to the flat regression line predominate for values of x between values of x from a to b in Figure @ref(fig:quadreg2). Positive residuals predominate for all other values outside of this region. If the residuals were only due to random influences, their ordering would be random in terms of + and -. Instead, a block of mostly positive residuals precedes a block of mostly negative residuals, which precedes again a block of mostly positive residuals.

When there is more than a single feature, how to evaluate the randomness of residuals across the values of the features? Once beyond two dimensions, the scatterplot of the data requires more than a flat surface. Instead of a scatterplot of the data, construct the scatterplot of the fitted values, \(\hat y\), directly against the residuals.

This fitted-residual scatterplot is the standard reference for evaluating the distribution of the residuals. The points in the plot should be symmetrically distributed around a horizontal line in the scatterplot. An indication of a “bowed” pattern, indicates that the model makes systematic errors for unusually large or small fitted values.

Figure @ref(fig:yhat) is standard output from the lessR Regression() function. The output includes not only the scatterplot, but also the best-fitting non-linear curve that summarizes the relationship, as well as labeling the point with the largest Cook’s Distance. The “bowed” pattern is clearly evident in the scatterplot.

Scatterplot of fitted values (yhat) with the residuals.

The pattern of residuals in the more general Figure @ref(fig:yhat) is the same as the scatterplot of the data about the flat regression line in Figure @ref(fig:quadreg2). Accogggrdingly, the data scatterplot need not be separately plotted. Instead, rely upon Regression() to provide the more general scatterplot for evaluation of the patterning of the residuals, such as for the detection of non-linearity.

Of course, another way to detect non-linearity between a feature and the target is to examine the corresponding scatterplot. Even with multiple features, individual feature-target scatterplots can indicate non-linearity The overall scatterplot of \(\hat y\) with the residuals may expose more general issues with the multiple regression that simultaneously considers all variables.

If non-linearity is an issue, it can be addressed by transforming the values of the affected variables, discussed in the transformation section.

9.1.2 Assumption 2

The second assumption of least-squares regression is a constant population standard deviation of the estimation errors at all values of \(x\), the equal variances assumption. There is a standard deviation of the residuals, \(\sigma_e\) in the population that applies to the variability of all points on the regression line. Estimate this one value, \(\sigma_e\) from the data with \(s_e\).

The one value of the standard deviation of the residuals implies that the value of \(y\) should be no more or less difficult to accurately predict for different values of \(x\). Any difference in the standard deviation of residuals for different values of \(x\) should be attributable only to sampling error.

Homoscedasticity: Standard deviation of residuals are the same to within sampling error for any value of the predictor variable.

Violation of homoscedasticity also has its own name.

Heteroscedasticity: Standard deviation of residuals differs depending on the value of the predictor variable.

If heteroscedasticity is present, there is no one standard deviation of residuals, but different (population) values at different points along the regression line.

The typical pattern exhibited by heteroscedasticity is a gradually increasing or decreasing variability as X gets larger or smaller. The scatterplot in Figure @ref(fig:spHetero) illustrates the scenario of increasing variability of \(y\) as the value of \(x\) increases. Accordingly, the size of the residuals about the regression line also tend to increase as the value of \(x\) increases.

Data scatterplot that indicates heteroscedasticity for the case of increasing variablity as the value of \(x\) increases.

Beyond the scatterplot of the data, Figure @ref(fig:spHetRes) illustrates the scatterplot of the fitted values (\(\hat y\)) with the residuals. This scatterplot reveals the pattern of heteroscedasticity even more starkly than does the data scatterplot.

Scatterplot of fitted values with the residuals that indicates heteroscedasticity.

When heteroscedasticity occurs, the corresponding standard errors of the regression coefficients are not correctly estimated. Some intervals will be too wide and others too narrow. The associated \(p\)-values and confidence intervals are also incorrect. Further, the standard deviation of the residuals is not a single value, though the estimation algorithm assumes that it is a single value. So some estimated prediction intervals are too wide, and others are too narrow.

There are statistical adjustments that, when applied, can lower heteroscedasticity, usually some version of what is called weighted least squares. However, the issue, in my opinion, is to understand the reason for the heteroscedasticity. For example, maybe there is a missing variable in the model that, when not present, results in the heteroscedasticity. To do statistical adjustments without understand what contributes to the anomaly avoids the most important issue. Still, if the regression model must be used in the presence of heteroscedasticity, correction methods such as weighted least squares can be appropriate.

9.1.3 Assumption 3

The third assumption of least-squares estimation is uncorrelated residuals with any other variable, including each other. Randomness correlates with nothing. For time-series data, the residuals at one set of time periods should not correlate with the same residuals shifted down one or more time periods.

The size of one residual provides no information about the size of a second residual. There should be no trend or pattern exhibited by the residuals, and the residuals should not correlate with the predictor variable.

This randomness of the residuals is the same concept illustrated by flipping a coin. If the first flip of a fair coin is a head, the second flip is just as likely to be a head or a tail.
The accuracy of this uncorrelated residuals assumption merits particular attention for the analysis of data collected over time.

Time-series data: Data collected for a variable over successive time points.

A company’s gross sales on a monthly basis over several years represents time series data.

How to predict next month’s gross sales? One method is with a special form of regression analysis in which the only variable in the analysis is the variable of interest, such as a month’s gross sales. The variation is the variation of the variable over time.

Time-series regression: Express the target variable as a function of its values at earlier time points.

Usually the time variable is named t.

\[y_{t+1} = b_{t}y_{t} + b_{t-1}y_{t-1} + b_{t-2}y_{t-2}+ \ldots + e\] In this example, \(y_{t+1}\) is next month’s gross sales, \(y_t\) is the current monthly sales value, and, \(y_{t-1}\) is last month’s gross sales, and so on. Each time period also has its own residual, how far the forecast for that time period’s value departed from the actual value.

The issue is that time series regressions often have correlated adjacent residuals, which represent adjacent time points.

Autocorrelation: Correlation of successive residuals over time.

The reason for autocorrelation in time series data is that successive values may follow a specific pattern, such as gradually increasing or gradually decreasing. In data such as these, if one predicted value of \(y\) is an underestimate, then the next time value is also likely to yield an underestimate. Knowledge of the sign of one residual yields predictive information about the sign of the next residual, indicating autocorrelation.

For example, sales of swimwear peaks in Spring and Summer and decreases in Fall and Winter. The residuals around a regression line over time would reflect this seasonality, systematically decreasing and increasing depending on the time of year. Analysis of time-oriented data typically requires more sophisticated procedures than simply fitting a regression line to the data.

Autocorrelation in time-series data can often be detected visually by plotting the residuals against time. Any emergent pattern indicates a violation of the assumption. Autocorrelation can be quantified by correlating the residuals with a new variable defined by shifting the residuals up or down one or more time points.

For example, the residual for the 1st month is paired with the residual for the 2nd month, the residual for the 2nd month is paired with the residual for the 3rd month, and so forth. For the assumption not to be violated, the resulting correlation of the two columns of residuals should be approximately zero.

Figure @ref(fig:lag1) shows two variables, the residuals for six time points and their corresponding Lag 1 residuals.

Residuals of six time points and their Lag 1 version.

The same values are in the two columns, just that values in one column are shifted down one time period. If there is no autocorrelation, the correlation of the two columns will be zero to within sampling error. If the correlation is larger than zero beyond sampling error, then the variables autocorrelate, in this example, a Lag 1 autocorrelation.

9.1.4 Assumption 4

A fourth assumption of regression is that the residuals are normally distributed for each value of X. This assumption is not needed for the estimation procedure, but is required for the hypothesis tests and confidence intervals previously described. To facilitate this evaluation Regression() provides a density plot and histogram of the residuals, which appears in Figure @ref(fig:norm). The residuals appear to be at least approximately normal, satisfying the assumption.

Distribution of the residuals with best-fitting normal curve and best-fitting general curve.

To assist in the visual evaluation of normality, two density plots are provided. Both the general density curve and the curve that presumes normality are plotted over the histogram.

9.2 Transformations

A linear regression model expresses a linear relationship between the features and the target. What if the relationship between a feature and the target is not linear? We can still often proceed with linear regression by transforming either the relevant features. Then run the linear regression on the transformed variable(s). Other transformations include transforming the target variable, or transforming all the variables.

How to choose a transformation to obtain linearity? Usually the answer follows from trial and error. Try different transformations, discussed below, in the search to locate the transformation that most increases fit, usually \(R^2\).

Consider two fundamental types of transformations.

Linear transformation: Transform the values of a variable with a linear function, multiplying each value by the same constant and adding a constant.

Of course, division is multiplying by the reciprocal, and subtraction is adding a negative number.

One example of a linear transformation is conversion from one measurement scale to another, such as feet to inches, or degrees in Fahrenheit to degrees in Celsius. Another example is standardization, the computation of \(z\)-scores from the variables expressed in their measured units.

A key property of linear transformations is that they preserve the relationships between variables. The relationships are the same before and after a linear transformation, only the units of measurements change. If looking a scatter of the variable with another variable, after a linear transformation the scatterplot would look exactly the same as before, just different units on the variable’s axis. The correlation of the variable of interest with another would also not change.

Other transformations do not preserve relationships, and can be useful for that reason.

Nonlinear transformation: Apply a non-linear function to the values of a variable.

Examples of a nonlinear transformation of variable x taking the square root of the values of a variable, or converting the values to their logarithms. Our interest in nonlinear transformations is that the relevant transformation may transform a nonlinear relationship into a linear relationship.

Once transformed to linearity, apply the standard linear regression analysis. The slope coefficients from the analysis of one or more transformed variables necessarily are expressed in the metric of the transformed variables. If studying Gross Sales, for example, we usually do not care as much as to the impact on the square root of USD Gross Sales or the logarithm of USD Gross Sales for a predictor variable on Sales as we do the impact of the predictor variable on Sales in USD. For the final interpretation, transform back to the original metric.

9.2.1 Quadratic Transformation

Read and plot the data in Figure @ref(fig:q2), returning to the same data from Figure @ref(fig:quad).

Plot(x,y)

Quadratic data.

The data are extremely non-linear, resembling a quadratic relationship. There is a strong relationship between the variables, but not a linear relationship. As the values of \(x\) increase to about 3.75, the values of \(y\) decrease. As the values of \(x\) increase beyond 3.75, the values of \(y\) increase.

Figure @ref(fig:q1) (and Figure @ref(fig:quadreg)) illustrate the futility of fitting a linear function to these data.

Plot(x,y, fit="lm", fit_se=0)

Scatterplot of quadratic data with a regression line.

The slope of the regression line is 0, as is \(R^2\).

r = reg(y ~ x, graphics=FALSE)
r$out_fit
Standard deviation of y: 1.831

Standard deviation of residuals:  1.900 for 13 degrees of freedom
95% range of residual variation:  8.210 = 2 * (2.160 * 1.900)

R-squared:  0.000    Adjusted R-squared:  -0.077    PRESS R-squared:  -0.357

Null hypothesis of all 0 population slope coefficients:
  F-statistic: 0.001     df: 1 and 13     p-value:  0.981

How to address this non-linear data to build a predictive model? Continue to use linear regression, but on transformed data. Because the data appear to be related by a quadratic function, define a new variable that is the square of the feature variable.

d$x2 = d$x^2
head(d)
    x    y   x2
1 1.1 12.9 1.21
2 1.2 11.9 1.44
3 1.9 10.7 3.61
4 2.1 10.9 4.41
5 2.8  7.8 7.84
6 2.8  9.6 7.84

Add that new quadratic variable to the regression equation.

r <- reg(y ~ x + x2, graphics=FALSE)
r$out_fit
Standard deviation of y: 1.831

Standard deviation of residuals:  0.747 for 12 degrees of freedom
95% range of residual variation:  3.253 = 2 * (2.179 * 0.747)

R-squared:  0.857    Adjusted R-squared:  0.834    PRESS R-squared:  0.783

Null hypothesis of all 0 population slope coefficients:
  F-statistic: 36.105     df: 2 and 12     p-value:  0.000

The enhanced model shows dramatically improved fit, from \(R^2 \approx 0\) to \(R^2=0.857\).

Figure @ref(fig:qr) plots the fitted values with the residuals. Unlike from the previous example from the linear regression of these data shown in Figure @ref(fig:yhat), this plot is of a more or less random relationship.

Scatterplot of the fitted values with the residuals with quadratic predictor.

Adding the squared value of \(x\) to the model clearly improves fit with approximately randomly distributed residuals. The squared term by itself, however, does not yield a viable regression. Both \(x\) and \(x^2\) need be included.

9.2.2 Exponential Transformation

An exponential function expresses the variable, \(x\), as an exponent to a given constant, \(b\), called the base.

\[y=b^x\]

The same exponential relationship can usually be expressed as an exponential function regardless of the choice of the base. From calculus the generally most convenient base is Euler’s constant, \(e =\) 2.7182818. To obtain values of \(e^x\) for given values of \(x\), R provides the exp() function.

Exponential functions are characterized by explosive growth, often applied to time. The system grows slowly at first, then grows rapidly, explosively. In particular, the system is non-linear.

Read some sample data and plot.

Plot(x,y)

An exponential relationship can be approximated by a line, but clearly only a crude approximation.

Plot(x,y, fit="lm", fit_se=0)

How to model this exponential relationship? The issue is that linear regression is only applicable to linear relationships. To perform a least-squares regression analysis on this exponential data, the data must first be transformed to a linear relationship.

Linearize exponential data by taking the logarithm of the data. A logarithm is an exponent. The logarithm of a given number \(a\) is the exponent to which another fixed number, the base \(b\), must be raised, to produce that number \(a\). For example:

\[e^2 = 7.389056, \; \; \textrm{so} \; \; log(7.389056) =2\]

The value of the exponent that transforms \(e\) to 7.389056 is 2, the logarithm. When the exponential function is based on \(e\), the logarithm is called the natural logarithm. The corresponding R function is log(), which defaults to the base of \(e\). Here, transform the target variable.

d$y.ln = log(d$y)

The relationship of \(x\) with the logarithmic version of \(y\) is linear, amenable to linear regression. A good, predictive model is obtained with the transformed data.

Plot(x, y.ln, fit="lm", fit_se=0)

Of course, this analysis expresses the predicted values from this regression equation in terms of the logarithm of \(y\), not the value of \(y\) directly. After transforming \(y\) to obtain linearity, next transform the obtained solution back to the original metric.

Back Transformation: Transform the values back to the original metric of the data with the inverse of the transformation used.

The back transformation expresses the estimated relationships in the original metrics of the variables as entered into the analysis.

To accomplish the back transformation, first obtain the values of \(b_0\) and \(b_1\) from the regression of the transformed data.

r <- reg_brief(y.ln ~ x, quiet=FALSE, graphics=FALSE)
b0 <- r$coefficients[1]
b1 <- r$coefficients[2]

Then undo the logarithmic transformation by setting the entire regression equation as the exponent. The exponential function is the inverse of the logarithmic function.

d$y_back <- exp(b0 + (b1*d$x))

This back transformation expresses the results of the linear regression directly as an exponential relationship.

Plot(x, y_back)

Even better, plot the obtained curve superimposed on the data.

Plot(x, y, fit="exp")

Instead of generating the entire curve, can also predict a specific value of \(y\) with the back transformation, here for \(x=7\).

exp(b0 + (b1*7))
(Intercept) 
   1406.411 

9.2.3 Square Root Transformation

Exponential data shows explosive growth as \(x\) increases. A more moderate version of increasing values is the square root transformation.

The scatterplot indicates that \(y\) increases as \(x\) increases, but not as severely as for an exponential relationship.

Plot(x,y)

Still, the relationship is non-linear.

Plot(x,y, fit="lm", fit_se=0)

To proceed with linear regression, first do the square root transformation of \(y\).

d$y.sr = sqrt(d$y)

With the transformed target variable, linear regression is appropriate.

Plot(x, y.sr, fit="lm", fit_se=0)

To express the obtained relationship in terms of the original metrics of the variables, first obtain the values of \(b_0\) and \(b_1\) from the regression of the transformed data.

r <- reg_brief(y.sr ~ x, quiet=FALSE, graphics=FALSE)
b0 <- r$coefficients[1]
b1 <- r$coefficients[2]

Express results in the original metric by undoing the square root transformation. Square the entire regression equation.

d$y_back <- (b0 + (b1*d$x))^2

Plot just the regression version of the data.

Plot(x, y_back)

Or, plot the regression output as well as the data.

Plot(x, y, fit="sqrt")

Applying the back transformation, fit values to the original metric of the non-linear data. This example calculates the fitted value for \(x=6\).

(b0 + (b1*6))^2
(Intercept) 
   32.62516 

9.2.4 Transformation Summary

The process of choosing the most appropriate transformation of the data to achieve linearity usually proceeds according to trial and error. Choose a transformation and see how well it works. Transform the target variable, or transform a feature variable.

Table @ref(tab:regeg) lists several frequently encountered transformations.

Some transformations.
model transform linear_regression fitted
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >standard</span> none y=b0 + (b1)x ŷ=b0 + (b1)x
Transform y
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >quadratic</span> sqrt(y) sqrt(y) = b0 + (b1)x ŷ = [b0 + (b1)x]**2
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >exponential</span> log(y) log(y) = b0 + (b1)x ŷ = exp[b0 + (b1)x]
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >logarithmic</span> exp(y) xxx xxx
Transform x
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >semi-logarithmic</span> log(x) y= b0 + (b1)log(x) ŷ = b0 + (b1)log(x)
Transform x and y
<span style=" font-weight: bold; color: rgba(47, 84, 155, 255) !important;" >double logarithmic</span> log(y), log(x) log(y)= b0 + (b1)log(x) ŷ = exp[b0 + (b1)log(x)]

Other transformations are possible, but those listed in Table @ref(tab:regeg) are the most frequently encountered in regression analysis. Although the choice of transformation is often based on trial and error, with an optimal transformation, \(R^2\) can increase dramatically. Of course, as always, explore different models with the training data but always verify the choice on testing data.