Introduction to Regression Models

Overview

There are two different primary types of models for forecasting, a crucial distinction for understanding how to create a forecast. We previously studied time series models at an intuitive level.

Time Series model forecasting: Forecast the value of a variable for the next time period based on the values of that variable at previous time periods.

The time series basis a forecast on the pattern of variability of the one variable across previous time periods. Identify the pattern and then forecast by extending that pattern into the future.

Later in the course, we introduce some analytic methods for generating time series forecasts. However, for the next three weeks (not including the midterm week), we apply an analytic technique that applies to both time series and also explanatory models.

Explanatory Models

Explanatory model forecasting: Predict the values of a variable based on patterns of relationships between that variable and two or more other variables.

The explanatory model forecast is based on the pattern of co-variation of one or more variables, the predictor variables, with each other and particularly with the variable to be forecasted. Still search for patterns but for these models the patterns are the correlations among the set of different variables.

There is a convention of reserving the term forecasting for time series models and the term prediction for explanatory, regression type models. That convention is not universally applied but I do tend to follow it throughout this course. The words forecast and predict have about the same meaning, so to use them differently is just a convention.

For explanatory models, we express the (linear, i.e., straight-line) relation between two variables visually with a scatterplot and statistically with the correlation coefficient. From that information, we construct regression analysis models for statistical forecasting by relating the variable of which the value we wish to forecast with one or more other variables. This week's content introduces correlation and simple regression. We relate the forecasted variable, called the target, with only one other variable, the predictor variable. In practice, most regression models are constructed with multiple predictor variables, but we have to start somewhere.

Constructing models with regression analysis is an essential statistical analysis with a wide variety of applications, maybe the most important, useful, and general statistical procedure that exists. For example, regression models apply to the analysis of mean differences such as the evaluation of the mean difference of salaries in a company between men and women. Regression models can do both time series forecasting and explanatory models prediction. The predictive models from regression analysis are also one type of machine learning analysis, a big deal in business, a skill in demand.

Content

Correlation

Concept

The value of the correlation coefficient informs the direction and extent of a linear relationship between two variables. The correlation coefficient does not provide a forecast but it is the basis for regression analysis that does provide the forecast. If there is no correlation between variables X and Y, then the value of Y cannot be forecasted from X. Before doing a regression analysis, it is worthwhile to examine the scatterplot that shows the relation, or lack of, between the variables of interest. This advice will become even more relevant when we build more realistic models that contain multiple predictor variables.

8.1

Correlation

slides [a, b only]

8.1a [8:57]

8.1b [8:16]

Computer: lessR Plot()

For a scatter plot of categorical or continuous variables of two variables generically X and Y: Plot(X,Y).

The ellipse option provides a visual aid to help identify any underlying pattern.

Plot(X, Y, ellipse=0.95)

This function call displays the 0.95 data ellipse, which, if the variables are normal, will contain, on average, 95% of the data values.

You can also obtain the best-fitting regression line by adding the parameter value fit="lm" to the Plot() function call, which indicates to plot the "linear model". More generally, use the regression analysis described below to obtain much more detailed statistical output as well as the same scatterplot with the regression line.

Regression Line Forecast

Concept

As with the stable process in which the forecast is the mean of the process output, forecasts based on the more general linear model with slope b₁ are still based on means, but now the mean of the value of Y for each specific value of X because the mean of Y now changes for each value of X. Refer to the overall mean of the system as the unconditional mean. The mean of Y for a given value of X is the conditional mean. For a stable process, b₁=0, that is, no slope, so the value of Y is unrelated to the value of X.

Begin with the data, the values of predictor X and response Y. The variables X and Y are each originally represented by corresponding columns of data values in a data table. A linear model of a single variable is of the form,

${"version":"1.1","math":"Y = b_0 + b_1(X) + e"}$

This model explains a data value for response variable Y in terms of a linear function of the corresponding value of predictor variable X, plus a residual or error that departs from perfect linearity. Of course, there is no specific function until the values of b₀ and b₁ are estimated from the data values for X and Y with a regression analysis on the computer, such as my Regression() function detailed below. Given specific values of b₀ and b₁, plug a specific value of the X predictor variable into the equation from the analysis to calculate the fitted value, the forecast computed from the model, of response variable Y without the error term, the residual,

${"version":"1.1","math":"\hat{Y}= b_0 + b_1(X)"}$

Applying the least-squares estimation algorithm on the computer, such as with the Regression() function, we estimate values of intercept b₀ and slope b₁. The estimates are specific numbers, which yields a specific forecasting equation, such as Ŷ_i = 2.5 + 10.6(X_i). (Ŷ_i is the fitted value for the ith row of data.) Plug in a specific value of X, X_i, into the model to generate a forecast for Y, Ŷ_i.

Online Textbook

Introduction to Machine Learning [Sec 1.1, 1.2, 1.3]
Section 1 is an overview of supervised machine learning. The most relevant sections for this week are at the beginning of this reading, Subsections 1.1, 1.2, and 1.3, which directly set up the following material on linear models and one form of machine learning, regression analysis, the topic of Section 2. Ultimately, it is all relevant to modern machine learning but feel free to focus on the first three sub-sections.

Linear Regression [Sec 2.1 - 2.4 (2.5 is after the midterm)]

Video

This video for Section 1 [22:15] is presented in chapters. Click on the 3 bars at top-left to navigate to a specific chapter.

The Prediction Equation [Sec 1.1, 1.2, 1.3]
Section 1 is an overview of supervised machine learning. The most relevant sections for this week are at the beginning of this reading, Subsections 1.1, 1.2, and 1.3, which directly set up the following material on linear models and one form of machine learning, regression analysis, the topic of Section 2. Ultimately, it is all relevant to modern machine learning but feel free to focus on the first three sub-sections as a basis for understanding the following videos on the linear regression model.

Linear Models [Sec 2.1, 2.2, 2.3; 21:38]
2.1 The form of the model, a linear prediction equation with slope, b₁, and intercept, b₀. [1:27]
2.2 Example two columns of data, for predictor variable x and variable to be predicted, y. [7:39]
2.3 The estimated model, that is, estimates for the slope and intercept from running computer software to do the regression analysis. [12:28]

Residuals and Estimation [Sec 2.4, 9:17]
How the machine learns by choosing values of b₀ and b₁ to minimize the (squared) residuals over all rows of data. The learning process is all about how far each forecasted value fitted (computed) by the prediction equation, ŷ_i, is to the actual value, y_i, the residual error term for each row of data, y_i - ŷ_i.

Computer: lessR Regression()

The lessR regression function Regression(), abbreviated reg(), produces the full regression output including the estimated values of b₀ and b₁ to construct the forecasting model and data visualizations such as the scatterplot with the regression line. This week we use the brief version, here with response variable Y (to be predicted) and predictor variable X:

reg_brief(Y ~ X).

Note the use of the tilde, ~, instead of a comma in the regression specification. As you might infer, the tilde and the comma refer to entirely different concepts in these R function calls. The tilde indicates the specification of a model, which explains or accounts for the variation of the variable on the left, the response variable Y, according to the variation of one or more predictor variables X, the variable(s) on the right of the tilde.

The output includes the estimated values of y-intercept, b₀, and slope, b₁, under the column labeled Estimate toward the beginning of the output of reg() or reg_brief(). Consider an analysis that uses a person's Height to predict their Weight. In that model, Height is the predictor variable, X, and Weight is the response variable, Y.

> reg_brief(Weight ~ Height)

Estimated Model for Weight Estimate Std Err t-value p-value Lower 95% Upper 95% (Intercept) -171.152 145.362 -1.177 0.273 -506.358 164.053 Height 4.957 2.141 2.315 0.049 0.019 9.894

Here, b₀ = -171.152 and b₁ = 4.957, which defines the estimated regression equation for calculating the fitted value, Ŷ_i :

    Ŷ_i = -171.152 + 4.957(X_i),    where X_i is a person's Height, Ŷ_i is the fitted value for that person's Weight.

Section 2.3.2.2 in the Online Textbook shows an example of computing a forecasted value from a regression equation with more detail. Here, find a briefer example of computing a forecasted value of Y, Weight, from the model estimated from the data with a regression analysis.

For example, if someone is 70 inches tall, then we can compute their forecasted weight, the value fitted by the estimated model:
    Ŷ_i = -171.152 + 4.957(X_i) = -171.152 + 4.957(70) = -171.152 + 346.99 = 175.84 lbs.

Suppose the person weighs 180 lbs. How close is the forecasted value, Ŷ_i, to what actually occurred, Y_i? The answer is the residual or error, e_i:
    e_i = Y_i - Ŷ_i = 180 - 175.84 = 4.16 lbs
The person weighs 4.16 lbs less than what was predicted from the model.

See also the short video on obtaining and reading the regression model:

Estimate the Model with reg_brief() [2:39]

Two General Applications

Regression analysis generates forecasts from two classes of models:

Values of explanatory variables X to forecast a value of Y

Time series variables, which use the value of the variable at previous time to forecast a value of Y

Each application is illustrated in this week's homework. For the time series application, this week we only consider trend lines for data that have no underlying seasonality. Trend lines imposed upon underlying seasonality are discussed later in the term.