Voice Narration
Her
Him
Click play to listen while reading.
Advantages of Multiple Regression
In the real world, when we use regression models to forecast the target variable, we almost always have more than a single predictor variable. We end up with multiple predictor variables because if we choose the predictor variables carefully, that will allow us to further the two goals of regression analysis:
- Make predictions, to forecast.
- Understand the relationship between the predictor variable and the target variable.
Statistical Control
We introduce a new concept when we have multiple predictor variables, one of the most important concepts in data analysis: statistical control. The only unequivocal means to isolate cause (predictor) and effect (target) is experimental control, where we run true experiments with randomization and manipulation of the environment. However, experimental control is typically not available for the kinds of questions that interest us the most. Instead, we use statistical control to analyze the effect of one variable via its slope coefficient on the target or response variable, holding the values of the other variables constant to statistically control for their effects.
With statistical control, the direct contributions of all other predictor variables in the model are held constant. Always make sure to add that qualifying phrase to the interpretation of these slope coefficients from a multiple regression model. The impact of one predictor variable on the target variable depends upon the other predictive variables in the model. It is not an independent effect. For that reason, slope coefficients from a multiple regression model are called partial slope coefficients.
Predictor Variable Selection
We also need a technique for selecting the best set of predictors. Say we have 15. My experience has been that we end up with between 4 and 6 predictor variables in the model. After that, it’s difficult to gain any real advantage by adding more, although it is certainly possible. We use what we call the best subset analysis, which takes advantage of computer power to give us all the possibilities of the different predictor variables.
We want the best subset of predictor variables, not necessarily the one that gives the highest \(R^2\) value.
Of course, fit should be relatively high compared to most alternatives. We want to achieve a decently high \(R^2\) compared to the others, with the fewest predictors. We call that parsimony, and we seek it when we build these models.
Correlations Among the Variables
When we select new predictor variables or delete existing ones, we follow the principle that each predictor variable should add new information to the model. Each selected variable in a multiple regression model should satisfy two criteria. A predictor variable selected to be added to an existing model should provide:
- Unique Information: A predictor variable is relatively uncorrelated with the predictor variables already in the model.
- Relevant Information: A predictor variable correlates well with Y.
We want some correlations to be small and others to be large. Always examine the correlations among all variables in the analysis to get a sense of how the regression will work and to understand and interpret the results when we run it. A general guideline, though far from an absolute imperative, is to have correlations among the predictor variables below 0.3 and each predictor variable’s correlation with the response above 0.5.
Examine all scatter plots between variables. If some predictor variables are too highly correlated, we call that collinearity. That leads to problems where it’s difficult to separate the variables, and we end up with high standard errors, implying that neither of tow collinear protective variables contributes to the model, and actually either one by itself would contribute well. Although it does not affect predictability, it prevents us from correctly interpreting the contributions of collinear predictor variables.
Basic Knowledge
Multiple regression, along with what we’ve done in previous weeks, will give us a fairly decent grasp of regression analysis. Although in our data science program, I teach a whole course on regression analysis that covers much more, it is still not everything that can be covered. But at least with these three weeks of regression analysis, we’ll have a basic foundation and, presumably, enough understanding to apply it in the real world, on the job.