Regression: Fit, Inference, Prediction

Content

Prerequisite Week 4 Content

Reading

Introduction to Machine Learning [Sec 1 (e.g., Linear Regression)]
Section 1 is an overview of supervised machine learning. The most relevant sections for this week are at the beginning of this reading, Subsections 1.1, 1.2, and 1.3, which directly set up the following material on linear models and one form of machine learning, regression analysis, the topic of Section 2. Ultimately, it is all relevant to modern machine learning but feel free to focus on the first three sub-sections as a basis for understanding the following videos on the linear regression model.

Linear Regression [Sec 2.1 - 2.4]

Videos

Linear Models [Sec 2.1, 2.2, 2.3; 21:38]
2.1 The form of the model, a linear prediction equation with slope, b₁, and intercept, b₀. [1:27]
2.2 Example two columns of data, for predictor variable x and variable to be predicted, y. [7:39]
2.3 The estimated model, that is, estimates for the slope and intercept from running computer software to do the regression analysis. [12:28]

Residuals and Estimation [Sec 2.4, 9:17]
How the machine learns by choosing values of b₀ and b₁ to minimize the (squared) residuals over all rows of data. The learning process is all about how far each forecasted value fitted (computed) by the prediction equation, ŷ_i, is to the actual value, y_i, the residual error term for each row of data, y_i - ŷ_i.

This Week's Content

Overview

This week extends the Week 4 material, completing our introduction to regression analysis of one-predictor variable models. After you estimate the model, now what? Do the following two analyses.

The first topic this week is model fit. It is possible to put a line through any scatterplot, but does the line adequately summarize the data? An excellent fitting model summarizes the data with little scatter about the line. A poor fitting model is characterized by much scatter about the line. Always evaluate the fit of the model after the model is estimated.

The second topic of the week illustrates a fundamental statistical and data analysis principle: Always distinguish the results of an analysis of either describing the sample or generalizing to the population. Accompany every sample estimate with a range of values that most likely includes the true, corresponding population value, and it is this range upon which your conclusions are based. Statistical estimates, sample values, do not equal the actual population value. Instead, sample estimates are computed from the somewhat arbitrary sample gathered from the entire population of possibilities. Flip a coin 10 times and get six heads. Flip the same coin another 10 times and get four heads. Each sample contains sampling error which means each sample estimate fluctuates from sample to sample. The range of values placed around a sample estimate here is applied two different ways, which correspond to the two primary reasons for performing a regression analysis.

Goal #1 of Regression Analysis: Predict the Unknown with Y^. Virtually every prediction is literally false. An interval should accompany every anticipated value. The goal is not to predict the future value exactly but instead provide the range of values that will likely contain the value when it occurs. This range of values for Y is the prediction interval.

Goal #2 of Regression Analysis: Understand Relationships with b₁. What are the relationships of the variables in the model to each other, and particularly, to the response variable? The slope coefficient in the model expresses how changes in the predictor variable lead to changes in the response variable, on average. If you increase the shipment weight carried by a vehicle by 1000 pounds, what is the expected fuel mileage decrease? The slope coefficient is another statistical estimate computed from a sample, not the true population value. The interval placed around the sample slope coefficient that likely contains the true population value is the confidence interval of the slope coefficient. Or, use the hypothesis test to evaluate the likelihood that there is no relationship between predictor and response, the population value of the slope is 0. The logic of the hypothesis test is that if the obtained sample value of the slope coefficient is far from zero, in terms of standard errors, then it is unlikely that zero is the true value.

When we complete our analysis and report our results to the primary stakeholders, we no longer care about the sample values obtained in our one arbitrary sample. Instead, we report results according to the relevant intervals that account for sampling error: The prediction interval for a predicted value, and the confidence interval for a slope coefficient. The computer calculates not only the sample estimates but also the corresponding intervals that likely contain the true value. The confidence interval is also often reported in conjunction with a related analysis called the hypothesis test. Both the confidence interval and the hypothesis test cast address the issue of the true population value of the slope coefficient, generalizing and moving beyond the obtained sample results.

Note: Pay attention to the above paragraphs. They provide the overall framework for this week's material. Use the subsequent readings and videos to fill in the details within that framework.

Reading

Note: The highlighted paragraphs with the pointing hand show how to answer questions from the Regression Template, which becomes your guide for doing regression analysis.

Model Fit [Sec 2.5]

Inference for the slope coefficient [Sec 3]

Prediction intervals [Sec 4.3]

Videos

Now, for the first time ever, you have your choice of narrator for Section 2.5, thanks to the wonders of AI. The content is the same for the following three videos, only the narrator differs for each.

This transformation to other voices was not a simple push-the-button process. I had to devise a work flow across multiple technologies and then integrate the results, with many needed CPU cycles. I will likely not be doing any more of these conversions in the near future but am interested in your preference.

Generating the AI narratives requires working with the written transcript obtained various ways from these technologies. However, that gives me the opportunity to revise the transcript and regenerate the narrative with AI. Totally new process that is now available with the technology for which I have access. I did this procedure for the AI Nancy narrative, so I have listed that first.

Model Fit with AI Nancy [Sec 2.5, 22:05]

Model Fit with Gerbing [Sec 2.5, 22:40]

Model Fit with AI Eric [Sec 2.5, 22:21]

Evaluate the fit of the model to the data from which the model was estimated (trained). (However, ultimately, fit is assessed on new data, previously unseen by the model, as discussed at the end of the presentation and in Section 1.6.)

Vote for your preference

Inference [Sec 3.2, 12:29]
Explains the conceptual basis of statistical inference, how we generalize from the sample we observe to the population as a whole. The explanation includes the concept of a standard error, by which we evaluate how well our sample estimates represent the corresponding population values of interest.

HT and CI [Sec 3.3, 15:56]
Do the inference with the Hypothesis Test and the Confidence Interval of the slope coefficient.

Example [Sec 3.4, 11:32]
Presents a complete example of statistical inference.

Prediction Intervals [Sec 4.3, 17:52]
A forecast is not a specific value but an interval about that value, the prediction interval, explained in this video.

Computer: Forecasting with lessR Regression Analysis

See the posted material for more detail and explanation. Below we have the executive summary.

Confidence Intervals

Use the full lessR Regression() function, here abbreviated reg(), for response variable y and one predictor variable x:
reg(y ~ x)

For example, to estimate a regression model of a person's weight as a function of their height, enter the following expression into R given that there is a d data frame that contains the variables Weight and Height.
reg(Weight ~ Height)

Obtain the following output for the 95% confidence intervals for the intercept and slope coefficient, the same as presented from Week 4. We have the sample estimates b₀ and b₁, the corresponding standard errors, corresponding t-tests (t-value and p-value assuming b=0), and the lower and upper bounds of the corresponding confidence intervals.

Estimated Model for Weight
             Estimate  | Std Err | t-value  p-value  | Lower 95%   Upper 95%
(Intercept)  -171.152  | 145.362 | -1.177    0.273   | -506.358     164.053
     Height     4.957  |   2.141 |  2.315    0.049   |    0.019       9.894

From the data, the machine learned the following: b₀ = -171.152 and b₁ = 4.957, which defines the estimated regression model for calculating the fitted value that describes the relationship only in this particular sample, Ŷ_i :

ŷ_i = -171.152 + 4.957(X_i)
where X_i is a person's Height and ŷ_i is the fitted value for that person's Weight. Always note the distinction between the value fitted by the model ŷ for a given row of data and the actual data value, y.

Sample Interpretation of the slope coefficient: In this sample, for each additional inch of height, on average, weight increases 4.957 lbs.

But we always interpret the results of an analysis in terms of the population, not the one, arbitrary sample from which the model was estimated (machine learning). The 95% confidence interval for the slope coefficient is 0.19 to 9.89, which is a statement of the confidence interval -- not the interpretation of the confidence interval, which follows.

Interpretation of the slope coefficient: In the population, on average, for each additional inch of height, with 95% confidence, weight increases somewhere between 0.19 lbs to 9.89 lbs.

Can you see why we always interpret a statistical result in terms of the population? Otherwise we have no idea as to the generality of the result. In this case we have a very wide CI, and therefore a lousy estimate of the relation between height and weight. We would know as managers to not use any estimate of the slope coefficient in any decision making.

Prediction Intervals

In practice, we estimate the forecasting model, than get some new data from which we wish to apply the model to computing the forecast. When we want to apply the model to prediction, we apply the model to new data, one or more new values of x. By coincidence, the new data values may equal values already in the original data table from which the model was estimated. Default prediction intervals are provided for all values of x in the training data, the data from which the model is estimated.

To specify custom values of x from which to compute the predicted values and associated prediction intervals, use the X1_new parameter, for specifying the values of the first predictor variable for which to generate prediction intervals, here the only predictor variable. Then we use the X1_new parameter.

To specify more than a single value of x to obtain prediction intervals, enter the multiple values as a vector, a variable with multiple values. In R, the most general way to create a vector is the c() function, for combine. For example, consider predicting weight from height. To predict weight from two specific values of Height, 67.5 and 71.0 inches, use the expression X1_new=c(67.5, 71.0) to generate prediction intervals for values in the following function call.

 d <- Read("http://web.pdx.edu/~gerbing/data/bodyfat100.csv")
 reg(Wt ~ Ht, X1_new=c(67.5, 71.0))

The value of the X1_new parameter is one vector, which consists of multiple values.

The forecasted value and the lower and upper bounds of each prediction interval about that value is provided for each value of X1_new.

Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals
 [sorted by lower bound of prediction interval]
----------------------------------------------------
       Ht      Wt   pred      sf  pi:lwr  pi:upr  width
 1 67.500 157.656         15.148 122.725 192.587 69.862
 2 71.000 175.861         14.420 142.608 209.114 66.506

Ht: x-variable value
Wt: y-variable, value, missing because it is if unknown and a true forecast is being made
pred: predicted or forecasted value, y^
sf: standard error of forecast
pi:lwr : lower bound of the prediction interval
pi:upr : upper bound of the prediction interval
width: width of the prediction interval