Approximately 8 homework assignments, always due Thursdays by 11:59p. Homework will feature a combination of theoretical derivations and empirical implementation of models using R or other software.
Both exams will be take-home format, with about 5 days between assignment and due date. Midterm will be due during week 6, final will be due by the end of finals week.
Regression is a model investigating the relationship betwen variables. In the most general mathematical form,
\[Y = f(X; \beta) + \epsilon\]
Where \(X\) is data, \(\beta\) are unknown parameters, and \(\epsilon\) is an error term with no relationship to \(X\).
Two primary components of regression analysis are finding \(\beta\), known as estimation, and interpreting \(\beta\), known as inference.
These models can be mechanistic or empirical. A mechanistic model may be informed by some sort of background knowledge or theory that informs the functional specification, eg. a physics, biological, or economic model. Empirical models begin without this background information and only look at the relationship between variables as expressed in the data. Our course will focus on empirical models.
It is important to distinguish between causal vs relational models. Regression analysis does not uncover causation, only association or relationships between variables.
Recall, the basic specification for simple linear regression:
\[Y= \beta_0 + \beta_1 X + \epsilon\]
For the error term, \(\epsilon\), we have \[E[\epsilon] = 0\] \[V(\epsilon)= \sigma^2\] Ideally, \(\sigma^2\) is known, but typically it is not, so it needs to be estimated along with the \(\beta\) coefficients.
Say we have a scatter plot in two variables. We can draw a number of linear functions through the data. So our question is, what is the best line of fit?
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
For a single data point under our specification,
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
\[\epsilon_i = y_i- \beta_0 - \beta_1 x_i\]
We might want to try to minimize the outliers, meaning minimize the size of \(\epsilon\) generally. We also might want to penalize large outliers. This is the idea behind least squares estimation. Our objective function becomes
\[\min\limits_{\beta_0, \beta_1} \sum_{i=1}^n[y_i - \beta_0-\beta_1 x_i]^2\]
\[S = \sum_{i=1}^n[y_i - \beta_0-\beta_1 x_i]^2\]
\[\frac{∂S}{∂\beta_0} = \sum_{i=1}^n -2[y_i - \beta_0-\beta_1 x_i]\]
\[\frac{∂S}{∂\beta_1} = \sum_{i=1}^n -2x_i[y_i - \beta_0-\beta_1 x_i]\]
To optimize, we set both equal to zero:
\[\sum_{i=1}^n y_i - \sum_{i=1}^n \beta_0 - \sum_{i=1}^n \beta_1 x_i = 0\]
\[\sum_{i=1}^n x_iy_i - \sum_{i=1}^n \beta_0x_i - \sum_{i=1}^n \beta_1 x_i^2 = 0 \]
Then, simplifying some of the summations:
\[\overline{y} = \frac{1}{n}\sum_{i=1}^ny_i\]
\[\overline{x} = \frac{1}{n}\sum_{i=1}^nx_i\]
\[\overline{xy} = \frac{1}{n}\sum_{i=1}^nx_iy_i\]
\[\overline{x^2} = \frac{1}{n}\sum_{i=1}^nx_i^2\]
we can rewrite our solutions as:
\[\overline{y} - \beta_0 - \beta_1 \overline{x} = 0\]
\[\overline{xy} - \beta_0 \overline{x} - \beta_1 \overline{x^2} = 0\]
So, our estimates can be written in terms of only the data with some substitution:
\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\] \[\hat{\beta}_1 = \frac{\overline{xy} - \bar{x}\bar{y}}{\overline{x^2} - (\overline{x})^2}\]
Qualitatively, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared residuals in the original model \(Y = \beta_0 + \beta_1 X + \epsilon\).
Then expanding back to summation notation:
\[\overline{x^2} - (\overline{x})^2 \Rightarrow n\overline{x^2} - n(\overline{x})^2\]
\[= n \frac{1}{n} \sum_{i=1}^nx_i^2 - n \left(\frac{1}{n} \sum_{i=1}^n x_i\right)\left(\frac{1}{n}\sum_{i=1}^n x_i\right)\]
\[= \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)\left(\frac{1}{n}\sum_{i=1}^nx_i\right)\]
\[ = \sum_{i=1}^n\left(x_i^2 - 2\overline{x}x_i + \overline{x}^2\right)\]
\[ = \sum_{i=1}^n(x_i - \overline{x})^2 \equiv S_{xx}\]
\(S_{xx}\) resembles the variance in the observed data for \(x\). Similarly,
\[S_{xy} = \sum_{i=1}^n y_i(x_i - \overline{x})\]
Alternatively, because covariance is unaffected by addition of a constant term to one of the variables,
\[S_{xy} = \sum_{i=1}^n (y_i - \overline{y})(x_i - \overline{x})\]
(left for a homework assignment).
\(S_{xy}\) is related to the numerator of our earlier expression for \(\hat{\beta}_1\), and \(S_{xx}\) is related to the denominator:
\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_0 = \overline{y}- \frac{S_{xy}}{S_{xx}}*\overline{x}\]
Now, from an initial assumption that \(y = \beta_0 + \beta_1x + \epsilon\) we can write our solution as an estiamated line of best fit as \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x\). The distinction between those two equations is important. \(y\) is a random variable composed of a deterministic element and a random element. \(\hat{y}\) is the expected value of \(y\) conditioned on the data \(x\). Our estimatetells us the conditional mean of \(y\).
\[\hat{y} = E[Y|X] = \hat{\beta}_0 + \hat{\beta}_1x\]
If \(b = \sum_{i = 1}^n a_ix_i\), if \(a_i\)s are constants, then \(b\) is a linear combination of \([x_i]_{i = 1}^n\).
\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\]
So \(\hat{\beta}_0\) is a linear combination of \([y_i]_{i = 1}^n\), where \(a_i = 1/n\).
\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \overline{x})}{S_{xx}}\]
So \(a_i = \frac{x_i - \overline{x}}{S_{xx}}\).
We need to show that \(E[\hat{\beta}_0] = \beta_0\) and \(E[\hat{\beta}_1] = \beta_1\)
Assume that \(E[\hat{\beta}_1] = \beta_1\).
\[E[\hat{\beta}_0] = E[\overline{Y} - \hat{\beta}_1\overline{x}] = E[\overline{Y}] - \overline{x}E[\hat{\beta}_1]\]
\[ = \frac{1}{n} \sum_{i = 1}^nE[Y_i] - \overline{x}\beta_1\]
\[ = \frac{1}{n} \sum_{i = 1}^n (\beta_0 + \beta_1x_i) - \overline{x}\beta_1\]
\[ = \beta_0 + \beta_1 \overline{x} - \overline{x} \beta_1\]
\[= \beta_0\]
\[\begin{align*} \end{align*}\]
Recall our linear model, \(Y = \beta_0 + \beta_1X + \epsilon\). We estimate the parameters \(\beta_0\) and \(\beta_1\).
Under OLS, our parameter estimates are
\[\hat{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{S_{xx}}\]
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
We had some properties of our estimates.
Model assumption is that \(E(Y) = \beta_0 + \beta_1X\) and \(V(Y) = \sigma^2\). So
\[V(\hat{\beta}_1) = V\left(\frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{S_{xx}}\right)\]
\[ = \frac{\sum_{i=1}^n (x_i - \bar{x})}{S_{xx}}V(Y_i)\]
\[V(\hat{\beta}_0) = V(\bar{Y} - \hat{\beta}_1\bar{X}) = V(\bar{Y}) + (\bar{X})^2V(\hat{\beta}_1) + 2Cov(\hat{\beta}_1\bar{X}, \hat{Y})\]
\[ = V\left(\frac{1}{n}\sum_{i=1}^nY_i\right) + (\bar{X})^2\frac{\sigma^2}{S_{xx}} + 2\bar{X}Cov(\hat{\beta}_1, \hat{Y})\]
\[ = V\left(\frac{1}{n}\sum_{i=1}^nY_i\right) + (\bar{X})^2\frac{\sigma^2}{S_{xx}} + 2\bar{X}(0)\]
\[ = \frac{1}{n} \sigma^2 + \frac{(\bar{X})^2}{S_{xx}}\sigma^2\]
\[ = \left(\frac{1}{n} + \frac{(\bar{X})^2}{S_{xx}}\right)\sigma^2\]
Our estimates are the best linear unbiased estimators, because we cannot find another unbiased estimator with lower variance. (How do we know? Cramer-Rao..?)
\[ = \sum_{i=1}^n(y_i - \hat{y}_i) = \sum_{i=1}^ne_i = 0\]
Sum of observations is
\[\sum_{i=1}^ny_i = \sum_{i=1}^n \hat{y}_i\]
The regression line passes through the centroid of point \((x_i,y_i)_{i = 1}^n\). In other words, \(\bar{y} = \hat{\beta}_0 + \hat{\beta}_1\bar{x}\).
A few more sums:
\[\sum_{i=1}^nx_ie_i = 0\]
\[\sum_{i=1}^n\hat{y}_ie_i = 0\]
In practice, \(\sigma^2\) is not known, so it needs to be estimated along with the \(\beta\) parameters. Recall, \(\sigma^2 = V(\epsilon_i), i = 1, 2, ..., n\).
We use our estimated residuals \(e_i\) to estimate \(\sigma^2\). We’ll call the sum of squared errors \(SS_{Res}\).
\[\begin{align*} V(\sigma^2) = E[\sigma^2 - E[\sigma^2]] \\ \sum_{i=1}^ne_i^2 &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ &= \sum_{i=1}^n(y_i - \beta_0-\beta_1 x_i)^2 \\ &= \sum_{i=1}^ny_i^2 - n(\bar{y})^2 - \hat{\beta}_1S_{xy} \\ SS_T &= \sum_{i=1}^n(y_i-\bar{y})^2 \\ &= \sum_{i=1}^ny_i^2 - n(\bar{y})^2 \\ \sum_{i=1}^ne_i^2 &= SS_T - \hat{\beta}_1S_{xy} \\ SS_T &= \sum_{i=1}^ne_i^2 + \hat{\beta}_1S_{xy} \end{align*}\]
Remember, \(SS_T\) is just a constant that is calculated from the original data. It does not depend on any estimates. \(SS_{Res}\) depends on our estimates, so it is sensitive to different models. To summarize,
\[SS_{Res} = \sum_{i=1}^n(y_i - \hat{y}_i)^2 \] \[SS_T = \sum_{i=1}^n(y_i - \bar{y})^2\] \[SS_R = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2\] \[SS_T = SS_{Res} + SS_R\]
We can use \(SS_{Res}\) to proceed with estimating \(\sigma^2\). \(E[SS_{Res}] = (n-2)\sigma^2\).
So \(\frac{SS_{Res}}{n-2}\) is an unbiased estimator of \(\sigma^2\).
Proof depends on characteristics of \(\chi^2\) distribution. \((y_i - \hat{y}_i)^2\) is a squared normally distributed variable \(\{N(0,\sigma^2)\}^2\).
The two beta parameters place restrictions on the degrees of freedom of the \(\chi^2\) r.v., so its degrees of freedom are \((n-2)\). If we had a total of \(k\) \(\beta\) parameters, the degrees of freedom would become \((n-k)\).
Original is \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\).
\[\begin{align*} &= \beta_0 + \beta_1x_i + \epsilon_i - \beta_1\bar{x} + \beta_1\bar{x} \\ &= (\beta_0 + \beta_1\bar{x}) + \beta_1(x_i + \bar{x}) \epsilon_i \\ &= \beta_0^1 + \beta_1(x_i + \bar{x}) \epsilon_i \end{align*}\]
The advantage of this specification is that it centers the \(x\) data at zero, so that \(\hat{\beta}_0^1 = \bar{y}\) (because now \(\bar{x} = 0\)). Also, \(\hat{\beta}_1\) is unchanged from the original.
In our model, we will come up with some estimates for our \(\beta\) parameters. Our question in hypothesis testing is whether the true \(\beta\) value is equal to our \(\hat{\beta}\). We can never prove this, but we can test whether our observed data would be more or less likely under different hypothetical values of \(\beta\).
The hypothesis problem is Whether \(H_0: \beta_1 = \beta_{1,0}\) or \(H_1: \beta_1 ≠ \beta_{1,0}\), where \(\beta_{1,0}\) is most commonly zero.
We need to find some test statistics on our parameter estimates that satisfy some distribution. To do this, we need to define the distribution of \(Y\). This involves assumptions about the distribution of \(\epsilon\). If we assume that \(\epsilon \sim N(0,\sigma^2)\), then we can proceed.
\(Y \sim N(\beta_0 + \beta_1X, \sigma^2)\)
Recall that our parameter estimates are linear combinations of the response variable data. And linear combinations of normally distributed variables are also normally distributed variables.
We need to make assumptions about the behavior of \(\epsilon\) to analyze the data.
Assumptions:
\(\epsilon_i \sim N(0,\sigma^2)\),
\(\epsilon_i\) are independent and identically distributed
For any \(1 ≤ i < j ≤n, E[\epsilon_i \epsilon_j] = E[\epsilon_i]E[\epsilon_j] = 0\)
Our test statistic, \(T\) is a function of the random variables \(Y_i\): \(T(Y_1, Y_2, ..., Y_n)\).
First, let’s construct our test statistic. There are many statistics we could create that are functions of \(Y_i\), but we want to create one that deals with workable distributions.
We know from our assumption about \(\epsilon\) that
\[Y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2)\]
\[\hat{\beta}_1 = \sum_{i=1}^n \frac{(x_i - \bar{x})}{S_{xx}}Y_i\]
So \[\hat{\beta}_1 = \sum_{i=1}^n \frac{(x_i - \bar{x})}{S_{xx}} N(\beta_0 + \beta_1x_i,\sigma^2)\]
So
\[\hat{\beta}_1 \sim N(\beta_1, \sigma^2 \frac{1}{S_{xx}})\]
We are going to test \(\beta_1\). And we see that it is related to the normal distribution because of our assumption about \(\epsilon\).
We can create a new random variable by transforming the random variable \(\hat{\beta}_1\) into a standard normal variable.
\[Z = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{\sigma^2}{S_{xx}}}} \sim N(0,1)\]
Then if we know \(\sigma^2\) we can use \(N(0,1)\).
If we don’t know \(\sigma^2\), then we need a different statistic.
Let’s proceed assuming we can use \(Z\).
Under \(H_0, \beta_1 = \beta_{1,0}\). If this is the case, then we would expect \(Z\) to be very close to 0. If \(Z\) is far from zero, then it suggests that either we captured a very unlikely dataset, or that our \(H_0\) is not correct.
But we need to define a likelihood threshold beyond which we might reject \(H_0\).
We can denote the combined shaded areas above as \(\alpha\). We can choose an arbitrary confidence level (lower \(\alpha\) corresponds to higher confidence), and then compare
\[|z| < \mathfrak{z}_{\alpha/2}\]
That is, compare our test statistic to the test statistic value that would result in a shaded area of \(\alpha\).
If we don’t know \(\sigma^2\), then we need to estimate it. Recall that an unbiased estimator for \(\sigma^2\) is
\[\hat{\sigma}^2 = \frac{SS_{Res}}{n-2}\]
\[\hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
Using this estimate, we can construct a different test statistic. It will no longer be standard normally distributed:
\[t = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]
Remember that a \(t\) distributed r.v. is the ratio of a standard normal r.v. and the square root of a \(\chi^2\) distributed random variable divided by its degrees of freedom:
\[ = \frac{N(0,1)}{\sqrt{\chi^2_p/p}}\]
To confirm that our test statistic \(t \sim t_{(n-2)}\), we need to show that this component of the denominator follows a \(\chi^2\) distribution:
\[\frac{SS_{Res}}{\sigma^2} \sim \chi^2_{(n-2)}\]
(This is left as a homework exercise.) Once we have confirmed this, then our criterion for \(H_0\) rejection is
\[|t| > t_{\frac{\alpha}{2},n-2}\]
The critical value of \(t_{\frac{\alpha}{2},n-2}\) will depend on the number of observations in the data. Smaller data sets result in t-distributions with fatter tails:
So our sequence is to observe \(x\) and \(y\), compute our OLS estimates, compute our \(t\) statistic,
We can also use analysis of variance to test whether or not \(\beta_1 = 0\). We can rewrite the quantity
\[\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]
\[SS_T = SS_{Res} + SS_{R}\]
The lefthand side has d.o.f. \(n-1\). \(SS_{Res}\) has d.o.f. \((n-2)\), and SS_R has d.o.f. 1. We can consider an \(F\) test for \(H_0: \beta_1 = 0\) vs \(H_1: \beta_1 ≠0\).
\[F = \frac{SS_R/1}{SS_{Res}/(n-2)} = \frac{MS_R}{MS_{Res}} \sim F_{1, (n-2)}\]
We will reject \(H_0\) if \(F > F_{\alpha, 1, (n-2)}\)
In the case of simple (univariate) linear regression, the \(F\) statistic is simply a squared \(t\) statistic. This is not the case in multiple linear regression. We won’t deal too much with the ANOVA method in this class.
To compute our test statistic for \(\beta_0\),
\[t = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\frac{SS_{Res}}{??}}}\]
We learned already how to perform a hypothesis test on either parameter. How can we answer the question ‘how far is our estimate from the truth?’ To answer a question like this, we use confidence intervals.
\[t_1 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]
\[t_0 = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}} \sim t_{n-2}\]
Let
\[se(\hat{\beta}_1) = \sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}\]
\[se(\hat{\beta}_0) = \sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}\]
Let us construct an interval \((a,b)\) such that \(P(a < T< b) = 1- \alpha\) for \(0 < \alpha < 1\).
Our goal is to find this interval given our parameter estimates and estimated standard errors.
Step 1: find the random varaible \(T\).
Step 2: find the interval for our parameter of interest.
For \(\beta_1\), step 1 is
\[T = \frac{\hat{\beta}_1 - \beta_{1}}{se(\hat{\beta}_1)} \sim t_{n-2}\]
Step 2 is to find an interval centered on \(T\). We will save this for Tuesday.
Review: we know that we can write the test statistics for our two \(\beta\) estimators as
\[t_0 = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}} \sim t_{n-2}\]
\[t_1 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]
Definition:
\(MS_{Res}= \frac{SS_{Res}}{n-2}\)
\(s.e.(\hat{\beta}_0) = \sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}\)
\(s.e.(\hat{\beta}_1) = \sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}\)
We are looking for an interval \((a,b)\) such that \(P(a<\beta_1<b) = 1- \alpha\) where \(\alpha\) is a confidence level between 0 and 1 (lower \(\alpha\) means higher confidence level).
Then we say that the interval \((a,b)\) is a \(100(1- \alpha)\)% Confidence Interval (C.I.) of \(\beta_1\).
\(\beta_1\) is our assumed ‘true’ parameter, which remains uknown. But weknow that \(t_1\), defined above, follows a T-distribution, so instead of \(P(a< \beta_1 < b)\), we can look at
\[P \left(a^* < \frac{\hat{\beta}_1 - \beta_1}{s.e.(\hat{\beta}_1)} < b^*\right)\]
\[==P \left(s.e.(\hat{\beta}_1)a^* < \hat{\beta}_1 - \beta_1 < s.e.(\hat{\beta}_1)b^*\right)\]
\[=P \left(\hat{\beta}_1 - s.e.(\hat{\beta}_1)b^* < \beta_1 < \hat{\beta}_1 - s.e.(\hat{\beta}_1)a^*\right)\]
Then if we rename the left and right components of the inequality as \(a\) and \(b\), then we can say that this is equivalent to
\[P (a < \beta_1 < b) = 1- \alpha\]
So now our goal is to find \(a^*\) and \(b^*\) such that \(P (a^* < t < b^*) = 1- \alpha\)
Where \(t \sim t_{n-2}\).
The blue shaded area above is equal to \(1- \alpha\).
We are looking for a test statistic value value \(c = b^* = -a^*\) from a \(t_{n-2}\) distribution that yields tails with area \(\alpha/2\).
The values of \(a\) and \(b\) that satisfy this requirement are
\[a = \hat{\beta}_1 - t_{\alpha/2, n-2}s.e.(\hat{\beta}_1)\] \[b = \hat{\beta}_1 + t_{\alpha/2, n-2}s.e.(\hat{\beta}_1)\]
The confidence interval for \(\beta_0\) follows a similar format,
\[ \left(\hat{\beta}_0 - t_{\alpha/2, n-2}s.e.(\hat{\beta}_0), \hat{\beta}_0 + t_{\alpha/2, n-2}s.e.(\hat{\beta}_0)\right)\]
Good Q from a student on where the d.o.f. for the test statistics comes from: why \((n-2)\)? In our model, we have two parameters (slope & intercept). so the residual sum of squares loses two degrees of freedom (out of an original \(n\)).
\[\frac{(n-2)MS_{Res}}{\sigma^2} \sim \chi_{n-2}^2\]
So the confidence interval for \(\sigma^2\) is
\[\left( \frac{(n-2)MS_{Res}}{\chi_{\frac{\alpha}{2},n-2}^2}, \frac{(n-2)MS_{Res}}{\chi_{1- \frac{\alpha}{2},n-2}^2} \right)\]
So we can construct CIs for all three of our parameters at a given confidence level. Now we want the confidence level of our condidional expected value, \(E[Y|X_0]\). This is the average of predictions at \(x_0\).
\(E[Y|X_0] = \beta_0 + \beta_1x_0\)
\[E[\hat{\beta}_0 + \hat{\beta}_1x_0] = \beta_0 + \beta_1x_0\]
\[V(\hat{\beta}_0 + \hat{\beta}_1x_0) = \frac{\sigma^2}{n} + \frac{\sigma^2}{S_{XX}}(x_0 - \bar{x})^2\]
If \(\sigma^2\) is known, then
\[\hat{\beta}_0 + \hat{\beta}_1x_0 \sim N(\beta_0 + \beta_1x_0, \frac{\sigma^2}{n} + \frac{\sigma^2}{S_{XX}}(x_0 - \bar{x})^2)\]
If \(\sigma^2\) is unknown, then
\[\frac{\hat{\beta}_0 + \hat{\beta}_1x_0- E[Y|x_0]}{\sqrt{\left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)MS_{Res}}} \sim t_{n-2}\]
So we can construct a CI for our conditional expected value of \(y\) at \(x_0\)
In the last case, we used \(\hat{\beta}_0 + \hat{\beta}_1x_0 \sim N(.,.)\)
In this case, \(y - \hat{y} = \epsilon \sim N(0,\sigma^2)\). \(y\) is our target, \(\hat{y}\) is our estimate.
If \(\sigma^2\) is known, then
\[\frac{y-\hat{y}}{\sqrt{\sigma^2}} \sim N(0,1)\]
But if \(\sigma^2\) is known, then we need to find \(E[\epsilon]\) and \(V(\epsilon)\) before we can proceed.
\[E[\epsilon] = E[y-\hat{y}] = 0\]
\[V(\epsilon) = V(y-\hat{y}) = V(y) + V(\hat{y})\]
(what about covariance?)
\[= \sigma^2 + \sigma^2\left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) = \sigma^2\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)\]
So the CI for \(y\)
\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}})}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}})} \right)\]
This is a confidnce interval for the predicted value of \(y\) at a particular point \(x_0\).
If we want the average confidence interval over a set of \(m\) points, then we take the average of CIs,
\[\frac{1}{m} \sum_{i = 1}^m y_i - \frac{1}{m} \sum_{i = 1}^m \hat{y}_i\]
(Need to clarify the above, it was presented as an aside.)
\(SS_T = SS_{Res} + SS_R\), ie
\[\sum_{i=1}^n (y_i - \overline{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \overline{y})^2\]
from the first homework.
Define \(R^2 = \frac{SS_R}{SS_T} = 1-\frac{SS_{Res}}{SS_T}\)
Then \(R^2\) quantifies the proportion of variation in \(y\) explained by variation in \(x\).
If our model results in zero errors, then \(SS_{Res} = 0\), so \(R^2 = 1\).
If our \(\hat{\beta}_1 = 0\) then \(\hat{y}_i = \bar{y}\) so \(SS_R = 0\) and \(R^2 = 0\).
See D2L for some introductory code on manually estimating OLS, as well as the lm() function.
## purity hydro
## 1 86.91 1.02
## 2 89.85 1.11
## 3 90.28 1.43
## 4 86.34 1.11
## 5 92.58 1.01
## 6 87.33 0.95
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -215.98 -50.68 28.74 66.61 106.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2627.822 44.184 59.48 < 2e-16 ***
## x -37.154 2.889 -12.86 1.64e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared: 0.9018, Adjusted R-squared: 0.8964
## F-statistic: 165.4 on 1 and 18 DF, p-value: 1.643e-10
## [1] 1106.559
## [1] -41112.65
## [1] -37.15359
## [1] 1693738
## [1] 166254.9
## [1] 2.889107
## [1] 59.47464
Above are two confidence bands. The wider blue band is for \(y_i\). The narrower red band is for \(\hat{y}_i\). More on this below.
In R, we can find a quantile of a T-distribution using qt():
\(qt(0.025, n-2) = -t_{0.05/2, n-2}\)
\[a = \hat{\beta}_1 + qt(0.025, n-2)s.e.(\hat{\beta}_1)\]
\[b = \hat{\beta}_1 + qt(0.975, n-2)s.e.(\hat{\beta}_1)\]
So our confidence interval for \(\hat{\beta}_1\) is \((a,b)\).
We can plot confidence bands for \(E[Y|X = x]\) and for \(Y|X = x\). The former takes into account uncertainty due to our estimators; the latter additionally takes into account the uncertainty due to \(\epsilon\).
CI of \(E[Y|X = x]\):
\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)} \right)\]
CI of \(Y|X = x\):
\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)} \right)\]
This is an important distinction to make. Uncertainty about the expected value of y is due to uncertainty about the parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that are used to generate an estimate of it. So the narrow bands around the estimated expected value of \(Y\) are due to the variance of the OLS estimators.
On the other hand, uncertainty about the random variable \(Y\) is due to the above uncertainty, plus the inherent variance of \(Y\) due to the \(\epsilon\) term in our original specification. See the October 13 lecture notes for derivation of the different variances of these two objects.
With our OLS estimates, we can compute \(\hat{y}\) at some point \(x_0\). Let’s say \(x_0 = 30\).
But if we are going to make estimations outside of our observed data range, we should be careful that the linear relationship still holds.
In the above graph, assume we only observe datas of xs below 20. We risk making bad predictions if we try to fit values of ys for hypothetical values of xs above 20 using the black OLS fit.
Some help with plotting confidence and prediction intervals in R.
Remember that when using the predict()
function, use the interval
argument to stipulate which type of interval you want.
Be sure to sort your data so that if you do a line plot, it’s connecting the dots in the proper order.
The range of your regressors. There is nothing to suggest that the functional specification that appears to match the existing data will also match new data.
Disposition of \(X\).
Outliers. Certain data points will have a lot of leverage on the parameter estimates.
Relation is not causation.
Regression through the origin. There may be some theory that tells us we should expect the functional relationship to have no intercept term.
\(Y = \beta_0 + \beta_1X + \epsilon\). In other words, \(Y \sim N(\beta_0 + \beta_1X, \sigma^2)\).
So the likelihood of our data set is
\[ L(y_i, x_i, \beta_0,\beta_1,\sigma^2) = \prod_{i=1}^n(2 \pi \sigma^2)^{-1/2} \exp\left[-\frac{1}{2\sigma^2}(y_i - \beta_0 - \beta_1x_i)^2\right]\]
\[ = (2 \pi \sigma^2)^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i)^2 \right]\]
As in OLS, we are trying to maximize an objective function with respect to our \(\beta\) parameters. We can denote our estimates \(\tilde{\beta}\) to distinguish them from the OLS estimators \(\hat{\beta}\).
First, we can log transform the likelihood function. Its maximum will be preserved under a monotonic transformation.
\[\ln L(y_i, x_i, \beta_0,\beta_1,\sigma^2) = -\frac{n}{2}\ln 2 \pi - \frac{n}{2} \ln \sigma^2 - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i)^2 \]
So our estimators will satisfy
\[ \frac{∂ \ln L}{∂ \beta_0} = \frac{1}{\tilde{\sigma}^2}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i) = 0\]
\[ \frac{∂ \ln L}{∂ \beta_1} = \frac{1}{\tilde{\sigma}^2}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)x_i = 0\]
\[ \frac{∂ \ln L}{∂ \sigma^2} = -\frac{n}{2\tilde{\sigma}^2} + \frac{1}{2\tilde{\sigma}^4}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)^2 = 0\]
If these first-order necessary conditions for likelihood maximization look familiar, it’s because they’re identical to the FONC for OLS, so \(\tilde{\beta} = \hat{\beta}\), although \(\tilde{\sigma}^2 ≠ \hat{\sigma}^2\).
So to compare and contrast the two methods:
Under OLS, we only need \(E[\epsilon] = 0\) and \(V(\epsilon) = \sigma^2\). For ML, we need \(\epsilon \sim N(0,\sigma^2)\).
The \(\beta\) estimates are unbiased under both methods.
Under ML, \(\tilde{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\), which is biased. The OLS estimate was \(\hat{\sigma}^2 = \frac{1}{n-2}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\). However, both estimators are consistent.
In addition to estimating the \(\beta\) coefficients, we are interested in the correlation coefficient, \(\rho\), between \(X\) and \(Y\).
\[\rho = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]
\[ = \frac{E(XY) - E(X)E(Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]
Our estimator for \(\rho\) is
\[r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\]
And we know
\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{r\sqrt{S_{xx}SS_T}}{S_{xx}} = r\sqrt{\frac{SS_T}{S_{xx}}}\]
Building on the construction of \(\rho\) from Tuesday.
\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{r\sqrt{S_{xx}SS_T}}{S_{xx}} = r\sqrt{\frac{SS_T}{S_{xx}}}\]
\[r = \frac{\sqrt{S_{xx}}}{\sqrt{SS_T}}\hat{\beta}_1\]
\[ r^2 = \frac{S_{xx}}{SS_T}\left(\hat{\beta}_1 \right)^2\]
\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]
So,
\[r^2 = \frac{SS_{Res}}{SS_T}\]
So \(R^2\) is an estimator for \(\rho^2\). We are interested in testing whether a data set has \(\rho= 0\). In other words, we want to test a null hypothesis that no variation in the response variable is associated with variation in the covariate.
The test statistic is
\[t_0 = r\frac{\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}\]
where
\[r = \sqrt{\frac{S_{xy}}{SS_T}}\hat{\beta}_1\]
If \(n\) is large (greater than 50 or so),
\[ z = \frac{1}{2}\ln\frac{1+r}{1-r} \hspace{5mm} \stackrel{D}{\longrightarrow} \hspace{5mm} N\left(\frac{1}{2}\ln\frac{1+\rho}{1-\rho}, \frac{1}{n-3} \right)\]
Because
\[\rho = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]
\(X\) and \(Y\) are random variables under our model specification.
When we write the relationship as
\[Y|X=x \sim N(\beta_0 + \beta_1X,\sigma^2)\]
we assume that the \(X\) values are fixed. But if we want to treat both as random variables, then we need to estimate the joint pdf of \(X,Y\):
\[f_{X,Y}(x,y) = f_{Y|X}(y|x)f_X(x) \]
which comes from basic conditional probabilities, eg.
\[P(A|B)P(B) = P(A \cap B)\]
If \(X,Y\) are both normally distributed,
\[ f_{X,Y}(x,y) = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp \left\{ -\frac{1}{2(1-\rho^2)}\left[ \left(\frac{y-\mu_1}{\sigma_1} \right)^2 + \left(\frac{x-\mu_2}{\sigma_2} \right)^2 + 2\rho\left(\frac{y-\mu_1}{\sigma_1} \right)\left(\frac{x-\mu_2}{\sigma_2} \right)\right] \right\}\]
If we want, we can calculate \(E[Y|X]\) and another value or two (see video around 58 min).
Let’s extend our model from Chapter 2 to estimate a relationship specified as
\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]
Doing this expands our possible hypothesis tests. We could test whether certain parameters are equal to each other; whether multiple parameters are jointly zero, and more. First, some matrix algebra.
Notation:
\(\mathbf{A}, \mathbf{B}\) - Matrices
\(a, b\) - Scalars
\(X, Y, Z\) - Random Variables
\(\overline{X}, \overline{Y}, \overline{Z}\) - Random Vectors
\(a_{1,3}\) - Element in row 1, column 3 of matrix \(\mathbf{A}\).
\(\mathbf{A_{11}}\) - for block matrices
\(\mathbf{A} = (\vec{a}_1, \vec{a}_n, ..., \vec{a}_n)\).
If there is an element \(\vec{a}_i\) in \(\mathbf{A}\) that can be written as
\[\vec{a}_i = \alpha_1 \vec{a}_1 + \alpha_2 \vec{a}_2 + ... + \alpha_n \vec{a}_n\]
Then \(\vec{a}_i\) is not independent. So the rank of a matrix is at most \(n\).
Posted on D2L Friday, November 6, will be due Friday, November 13.
The midterm will be focused on data analsyis and writing a professional report. The report should tell a story about the data, explain the model specification, and report and interpret the results qualitatively.
Raw R code and output should be saved for the appendix.
Homework 3 posted Thursday, October 29, due Thursday, November 5. Possibly homework 4 concurrent with midterm (not decided yet).
The rank of \(\mathbf{A}\) is the number of linearly independent columns of \(\mathbf{A}\).
An identity matrix is a matrix with 1s on the diagonal, zero otherwise.
\[ \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\]
\(\mathbf{A}^T\) interchanges the row and column indices of \(\mathbf{A}\). It is the case that \((\mathbf{A}^T)^T = \mathbf{A}\), and \((\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T\).
If \(\mathbf{A}^T = \mathbf{A}\), then \(\mathbf{A}\) is a square matrix.
An idempotent matrix satisfies \(\mathbf{A}\mathbf{A} = \mathbf{A}\).
Orthonormal matrix: A square matrix \(\mathbf{A}_{k \times k}\) is orthonormal if \(\mathbf{A}^T \mathbf{A} = \mathbf{A}\). If \(\mathbf{A}\) is orthonormal, then \(\mathbf{A}^{-1} = \mathbf{A}^T\).
Quadratic form of matrix: If we have
\[y = \begin{bmatrix} y_1 \\ y_2 \\ \cdots \\ y_n \end{bmatrix} \hspace{5mm} \mathbf{A}_{n\times n} \hspace{5mm} y^tAy = \sum_{i = 1}^n\sum_{j = 1}^n a_{ij}y_iy_j\]
a scalar.
If \(\mathbf{A}^T = \mathbf{A}\) and \(y^T \mathbf{A} y ≥ 0\) for any \(y ≠0\), then we call \(\mathbf{A}\) a positive semidefinite matrix.
\[ \mathrm{tr}(\mathbf{A}_{k \times k}) = \sum_{i = 1}^k a_{ii} \]
Rank of an idempotent matrix equals its trace.
Matrix partition. If we have \(\mathbf{X}_{n\times p}\), we can write it as
\[\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} \]
Algebra on partitioned matrices:
\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} \]
\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}_1 = \mathbf{X}_1 \]
\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}_2 = \mathbf{X}_2 \]
\[ \mathbf{X}'\mathbf{X} = \begin{bmatrix}\mathbf{X}_1'\mathbf{X}_1 & \mathbf{X}_1'\mathbf{X}_2 \\ \mathbf{X}_2'\mathbf{X}_1 & \mathbf{X}_2'\mathbf{X}_2 \end{bmatrix}\]
The inverse is listed in appendix C.2.1 (p 579).
For constants \(\mathbf{A}_{k\times k}\) and \(\mathbf{a}_{k \times 1}\), and variable vector \(\mathbf{y}_{k \times 1}\),
\[\frac{∂\mathbf{a}'\mathbf{y}}{∂\mathbf{y}} = \mathbf{a} \hspace{1cm} \frac{∂\mathbf{y}'\mathbf{y}}{∂\mathbf{y}}= 2\mathbf{y} \hspace{1cm} \frac{∂\mathbf{a}'\mathbf{A}\mathbf{y}}{∂\mathbf{y}} = \mathbf{A}'\mathbf{a} \hspace{1cm} \frac{∂\mathbf{y}'\mathbf{A}\mathbf{y}}{∂\mathbf{y}} = \mathbf{A}\mathbf{y}+ \mathbf{A}'\mathbf{y}\]
\[E[\mathbf{y}] = \mathbf{\mu}_{n\times 1} \hspace{1cm} V(\mathbf{y}) = \begin{bmatrix} V(\mathbf{y}_1) & Cov(\mathbf{y}_1,\mathbf{y}_2) & \cdots & Cov(\mathbf{y}_1, \mathbf{y}_n) \\ Cov(\mathbf{y}_2, \mathbf{y}_1)& V(\mathbf{y}_2) & \cdots & Cov(\mathbf{y}_2, \mathbf{y}_n) \\ \cdots & \cdots & \cdots & \cdots \\ Cov(\mathbf{y}_n, \mathbf{y}_1) & Cov(\mathbf{y}_n, \mathbf{y}_2) & \cdots & V(\mathbf{y}_n) \end{bmatrix}\]
\[\mathbf{U} \sim \chi^2_{p, \lambda} \hspace{1cm} \lambda = \mu'\mathbf{A}\mu\]
Example: If \(\mu = 0\), then \(\mathbf{y}\) is centered, so \(\lambda\) will also be zero.
Example: If \(\mathbf{V} = \sigma^2\mathbf{I}\) and \(\mathbf{A}\) is idempotent of rank \(p\),
\[\frac{\mathbf{U}}{\sigma^2} \sim \chi^2_{p, \lambda} \]
The latter will return for hypothesis testing.
\[ W_{m\times 1} = B_{m\times n}y_{n\times 1} \hspace{1cm} U_{1\times 1} = y'_{1\times n}A_{n\times n}y_{n\times 1} \]
Then \(U,W\) are independent if \(\mathbf{B}\mathbf{V}\mathbf{A} = 0\)
§ 3.1 - Introduction
Our new model, building on the simple model from Ch 2 is
\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]
We also have the possibility of regressing on variable transformations and interactions, eg
\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2 + \beta_4x_1x_2 + \epsilon\]
Our inference also has additional possibilities. We can test multiple parameters simultaneously.
§ 3.2 - Estimation
Assumptions: no interaction terms or variable transformations. \(E[\epsilon] = \vec{0}, V(\epsilon) = \sigma^2\mathbf{I}_n\).
We can write our model in matrix notation
\[\begin{bmatrix}y_1 \\y_2 \\ \cdots \\ y_n \end{bmatrix} = \begin{bmatrix}1 & x_{11} & \cdots & x_{1p} \\ \cdots & & & \cdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \cdots \\ \epsilon_n \end{bmatrix}\]
Our sum of squares objective function is identical to the simple case, and we area again looking for
\[\frac{∂S(\beta)}{∂\beta} = 0 \]
Restate our multiple linear model:
\[Y = X\beta + \epsilon \]
Where now, \(X\) represents an \(n\times (p+1)\) matrix of covariates,
Ge’s notation is \(\beta_0 ... \beta_p\) with \(p+1\) parameters.
Textbook is \(\beta_0 ... \beta_k\) with \(\overline{p} = k+1\) parameters.
Check out the confidenceEllipse()
function.
To perform OLS on matrices, our objective function is
\[S(\beta) = (Y - X\beta)'(Y - X\beta) = Y'Y - 2\beta'X'Y + \beta'X'X\beta \]
\[\frac{∂S}{∂\beta} = -2X'Y + 2X'X\beta \stackrel{\mathrm{set}}{=} 0 \]
\[(X'X)^{-1}X'X\beta = (X'X)^{-1}X'Y \]
\[ \hat{\beta} = (X'X)^{-1}X'Y\]
This estimator beahves just like the simple version in terms of calculating point estimates for \(\hat{Y} = X\hat{\beta}\).
We call the matrix \(X(X'X){-1}X' = H\) (I think this is the ‘projection’ matrix from Greene). Then
\[e = Y - \hat{Y} = Y- HY = (1-H)Y\]
The observed data \(Y\) is an n-dimensional vector, so it is in \(\mathbb{R}^n\).
The covariates consist of \(k\) n-dimensional vectors, typically with \(k < n\). There is no guarantee that \(Y\) can be replicated as a linear combination of the vectors in \(X\). The columns of \(X\) may only span \(\mathbb{R}^k\) (or less in the case of linear dependence). \(X\beta\) is a vector in the ‘estimation space’, which may get close to \(Y\).
\[SS_{Res} = (Y-X\hat{\beta})'(-X\hat{\beta}) \]
\[= Y'Y - 2\hat{\beta}'X'Y + \hat{\beta}'X'X\hat{\beta} = Y'Y - 2\hat{\beta}'X'Y +\hat{\beta}'X'\hat{Y} = \boxed{Y'Y - \hat{\beta}'X'Y} \]
Estimate for \(\sigma^2\) is \(SS_{Res}/(n-p)\), under MLE it is divided by \(n\).
We have a multiple linear regression model,
\[Y = X\beta + \epsilon \hspace{1cm} Y \in \mathbb{R}^n, \hspace{2mm} X_{n \times (p+1)}, \hspace{2mm} \beta \in \mathbb{R}^{p+1}\]
We want to test whether there is a linear relationship between the response variable and the regressors. \(H_0: \beta = \mathbf{0}, H_1:\) at least one \(\beta_j ≠0\) for \(j = 1, 2, ... p\). To test this, we will use the sum of squares values used in the construction of the ANOVA table.
\[SS_T = (Y-\overline{Y})'(Y- \overline{Y}) \hspace{1cm} SS_{Res} = Y'Y - \hat{beta}'X'Y \hspace{1cm} SS_R = \hat{\beta}'X'Y - \frac{1}{n}Y'\mathbf{1}_{n\times 1}\mathbf{1}_{1 \times n}'Y\]
With those values in hand, we can write that
\[\frac{SS_{Res}}{\sigma^2} \sim \chi_{n-p-1}^2 \hspace{1cm} \frac{SS_R}{\sigma^2} \sim \chi_p^2 \]
And we can generate a test statistic,
\[F_0 = \frac{SS_R/\sigma^2}{SS_{Res}/\sigma^2} = \frac{SS_R}{SS_{Res}} \sim F_{p,\hspace{1mm} n-p-1}\]
And we can proceed to test this using any desired significance level. Similar to the simple case,
\[E[MS_{Res}] = \sigma^2 \hspace{5mm}\Rightarrow \hspace{5mm} MS_{Res} = \hat{\sigma}^2 \]
\[E[MS_R] = \sigma^2 + \frac{\beta_c'X_c'X_c\beta_c}{p\sigma^2} \]
Where \(\beta_c\) is the \(\beta\) vector minus the constant term, and \(X_c\) is the data matrix without the column of ones, and where every element has its variable mean subtracted from it \((x_{ij} - \overline{x}_j)\)
Then, in the ANOVA table, we have the values shown in Ch 3.3 - Table 3.5 (p 85).
The critical value of an F test depends on the number of parameters and the number of observations.
As in the simple case,
\[R^2 = \frac{SS_R}{SS_T} = 1 - \frac{SS_{Res}}{SS_T} \]
In the multiple regression case, we have \(\hat{Y} = X\hat{\beta}\). We can define \(Y* = Y-\hat{Y} = Y - X\hat{\beta}\). If we find another set of regressors \(X*\), we can fit a new model \(Y* = X*\beta* + \epsilon*\).
The \(SS_T\) of the second model will be equal to the \(SS_R\) from the first model.
\(SS_{Res} = SS_R^* + SS_{Res}^*\)
So we are regressing the residuals from the original model on a new set of regressors.
Then \(SS_R + SS_R*\)
## Loading required package: carData
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 115.07 115.07 14.119 0.000877 ***
## Residuals 26 211.90 8.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 76.193 76.193 7.8998 0.009272 **
## Residuals 26 250.771 9.645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x1 1 115.068 115.068 22.378 7.492e-05 ***
## x2 1 83.343 83.343 16.208 0.0004635 ***
## Residuals 25 128.553 5.142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 76.193 76.193 14.818 0.0007288 ***
## x1 1 122.218 122.218 23.768 5.148e-05 ***
## Residuals 25 128.553 5.142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type II tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## x1 122.218 1 23.768 5.148e-05 ***
## x2 83.343 1 16.208 0.0004635 ***
## Residuals 128.553 25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type II tests)
##
## Response: y
## Sum Sq Df F value Pr(>F)
## x2 83.343 1 16.208 0.0004635 ***
## x1 122.218 1 23.768 5.148e-05 ***
## Residuals 128.553 25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
You can see in the table above that the Sum Sq
values for a given variable are different depending on what order the covariates appear in the regression. It looks like anova()
evaluates it as:
\[\begin{array}{ll} \\ x_1 & SS_R(\beta_1) \\ x_2 & SS_R(\beta_2|\beta_1) \\ x_3 & SS_R(\beta_3|\beta_1,\beta_2) \\ \cdots & \cdots \\ x_p & SS_R(\beta_p|\beta_1,\beta_2, ... \beta_{p-1}) \\ \end{array} \]
Meanwhile, the Anova
function evaluates it as:
\[\begin{array}{ll} \\ x_1 & SS_R(\beta_{-1}) \\ x_2 & SS_R(\beta_{-2}) \\ x_3 & SS_R(\beta_{-3}) \\ \cdots & \cdots \\ x_p & SS_R(\beta_{-p}) \\ \end{array} \]
Because \(R^2\) is increasing as we add additional regressors, we can construct a measure of model fit that penalizes additional regressors:
\[\overline{R}^2 = 1- \frac{SS_{Res}/(n-p-1)}{SS_T/(n-1)} \]
For \(H_0: \beta_j = 0\) vs \(H_1: \beta_j ≠ 0\), we can use a \(t\) statistic with the normal confidence interval interpretation:
\[t = \frac{\hat{\beta}_j}{\sqrt{\sigma^2c_{jj}}} = \frac{\hat{\beta}_j}{s.e.(\hat{\beta}_j)} \sim t_{n-p-1} \hspace{1cm} C = (X'X)^{-1}\]
We can also test individual coefficients using \(F\) statistics. If we want to test \(H_0: \beta_2 = 0\) in a model of the form \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\), we essentially have a ‘full’ (unrestricted) model and a ‘reduced’ model (restricted).
Full: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)
Reduced: \(Y = \beta_0 + \beta_1x_1 + \epsilon\)
For the full model,
\[MS_{Res} = \frac{Y'Y - \hat{\beta}'X'Y}{n-p-1} \]
For the reduced model,
\[SS_R(\beta_1) = \hat{\beta}_1'X_1'Y \]
So
\[SS_R(\beta_2 | \beta_1) = SS_R(\beta_1,\beta_2) - SS_R(\beta_1)\]
Where $SS_R(_1,_2) $ is the \(SS_R\) from the full model.
We can intepret this as saying that \(SS_R(\beta_2 | \beta_1)\) explains how much \(x_2\) contributes to the fit of the model. Aka the amount by which \(SS_R\) increases by adding \(x_2\) to a model with only \(x_1\).
So we want to assess whether \(SS_R(\beta_2 |\beta_1)\) is big or small.
\[F = \frac{SS_R(\beta_2 |\beta_1)/r}{MS_{Res}} \sim F_{r, \hspace{1mm} (n-p-1)} \]
Where \(r\) is the number of restrictions being tested. This means that the number of variables in the restricted model is \(p-r\).
\(SS_R(\beta_2 |\beta_1)\) has \(r\) dof.
\(SS_R(\beta_1, \beta_2)\) has \(p\) dof.
\(SS_R(\beta_1)\) has \(p-r\) dof.
So if we have a large value of \(F\), it indicates that it is likely that at least one of the parameters in \(\beta_2≠0\).
We can run this test on any subset of our \(p\) regressors. For example, we could test a single coefficient by finding \(SS_R(\beta_j|\beta_{-j})\). In this case, variable \(j\) is the ‘last’ variable added to the model, and we are testing to see whether it adds significantly to the regression sum of squares. This is the value being reported in the last row of the anova()
function.
GZ is clarifying the difference between the anova
function in base R and the Anova
function from the car
package.
anova
conditions the \(SS_{R}\) in each row on all the variables in rows above it. So the sum of the \(SS_R\) values will equal the entire model’s \(SS_R\).
Anova
conditions the \(SS_{R}\) in each row on all other variables. So the sum of the \(SS_R\) values will exceed the entire model’s \(SS_R\).
We have considered a number of tests:
Significance of regression using the \(F\) statistic
Tests on individual coefficients, and subsets of coefficients (\(t\) and \(F\) statistics)
Today we will look at general linear hypothesis testing. Our model is
\[Y = X\beta + \epsilon \]
Our testing problem is \(H_0: T\beta = c\) vs \(H_1: T\beta ≠ c\).
Recall, \(\beta\) is a vector of length \(p+1\), where \(p\) is the number of regressors.
In this test, \(T\) is an \(m \times (p+1)\) matrix, so \(c\) will be a vector of length \(m\).
This method allows us to test hypotheses like \(H_0: \beta_1 = \beta_3\). If we have a four-regressor model
\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon \tag{FM}\]
Then we can write the hypothesis above as \(H_0: T\beta = 0\) where \(T = \begin{bmatrix}0 & 1 & 0 & -1 & 0\end{bmatrix}\).
We can also test multiple hypotheses simultaneously:
\[H_0: \beta_2 = \beta_4, \hspace{3mm} \beta_3 = 0 \hspace{1cm} H_1: \beta_2 ≠ \beta_4, \hspace{3mm} \beta_3 ≠ 0\]
In which case we would need \(T\) to be a \(2\times 5\) matrix.
\[T\beta = \begin{bmatrix}0&0&1&0&-1 \\ 0&0&0&1&0\end{bmatrix}\beta = c = \begin{bmatrix}0\\0\end{bmatrix} \]
Under these general hypothesis tests, \(T\) has \(r\) linearly independent columns, where \(r\) is the number of restrictions.
If (1) is the full model, we could write the resricted model as
\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_2+x_4) + \epsilon = Z\gamma + \epsilon \tag{RM}\]
Then \(SS_{Res}(RM) = Y'Y - \hat{\gamma}'Z'Y\), which has \((n-p+r-1)\) dof.
(Should clarify what \(r\) is.)
We can calculate \(SS_H = SS_{Res}(RM) - SS_{Res}(FM)\), which has \(r\) dof (\((n-p+r-1)\) and \(n-p-1\) dof, respectively).
We test this using an \(F\) statistic, calculated as
\[F = \frac{SS_H/r}{SS_{Res}(FM)/(n-p-1)} \sim F_{r,\hspace{1mm} (n-p-1)}\]
Then we evaluate this statistic to test our hypothesis, which is the original set of restrictions.
Save the full and reduced models as lm
objects, then run
## Analysis of Variance Table
##
## Model 1: y ~ x1 + x2 + x3 + x4
## Model 2: y ~ x1 + x2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 117.83
## 2 25 128.55 -2 -10.725 1.0467 0.3672
linearHypothesis
can test other restrictions:
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = table.b1)
##
## Coefficients:
## (Intercept) x1 x2 x3 x4
## 0.516174 0.005898 0.003477 -0.337864 -0.005721
## Linear hypothesis test
##
## Hypothesis:
## - x1 + 2 x2 = 0
##
## Model 1: restricted model
## Model 2: y ~ x1 + x2 + x3 + x4
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 119.12
## 2 23 117.83 1 1.2938 0.2526 0.6201
We can also define a test matrix:
## Linear hypothesis test
##
## Hypothesis:
## x1 - 2 x2 = 0
##
## Model 1: restricted model
## Model 2: y ~ x1 + x2 + x3 + x4
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 119.12
## 2 23 117.83 1 1.2938 0.2526 0.6201
\[\frac{\hat\beta_j - \beta_j}{\sqrt{\hat{\sigma}^2C_{jj}}} \sim t_{n-p-1}\]
So a \(100(1-\alpha)\%\) CI for \(\beta_j\) is \(\hat\beta_j \pm t^*se(\hat{\beta}_j)\)
\[V(Y_0) = \sigma^2x_0'(X'X)^{-1}x_0\tag{3.48}\]
So the confidence interval on \(E[Y|x_0]\)
\[\hat{Y}_0 \pm t_{\frac{\alpha}{2},n-p-1}\sqrt{\hat{\sigma}^2x_0'(X'X)^{-1}x_0} \tag{3.49}\]
And the prediction interval on \(Y|x_0\) is
\[\hat{Y}_0 \pm t_{\frac{\alpha}{2},n-p-1}\sqrt{\hat{\sigma}^2(1+x_0'(X'X)^{-1}x_0)}\tag{3.54} \]
Some useful functions are confint
and confidenceEllipse
.
When we conduct \(t\) tests on individual coefficients, we can’t really combine them. Eg if we get a 95% CI on \(\beta_1\) and another on \(\beta_2\), then we might say that the confidence that both are in their respective intervals is \(0.95^2\), but this assumes that they are independent. But they aren’t independent because they are estimated from the same data, and have nonzero covariance.
We can construct a joint confidence interval
\[\frac{(\hat{\beta} - \beta)'X'X(\hat{\beta} - \beta)}{pMS_{Res}}\sim F_{(p+1); (n-p-1)}\]
Our model specification is \(Y = X\beta + \epsilon\), with \(p+1\) parameters. We have already seen that if we want an individual parameter CI, we use
\[\hat{\beta}_j \pm t_{\frac{\alpha}{2},(n-p-1)}s.e.(\hat{\beta}_j)\]
If we want a 95% CI on \(\beta_j\), then our critical value of the \(t\) statistic is \(t_{0.025, (n-p-1)}\).
If instead we want to find the CI of all of our parameters jointly, then we will have \(p+1\) intervals.
What is the probability that every true \((\beta_0, \beta_1, ..., \beta_p)\) is within its corresponding interval? If we assume that each parameter estimate is independently distributed, then this probability for 95% CIs is \((0.95)^{p+1}\), which can become small very quickly.
Even aside from that issue, the \(\beta\) estimates are not pairwise independent because \(\mathrm{Cov}(\hat{\beta}_i,\hat{\beta}_j) = \sigma^2C_{ij}≠0\).
Methods for overcoming this:
Normally, our \(t\) critical value was \(t_{\frac{\alpha}{2},(n-p-1)}\). Under this method, we change it to \(t_{\frac{\alpha}{2(p+1)},(n-p-1)}\).
We switch from using the \(t\) distribution to the square root of an \(F\) statistic:
\[\{2F_{\alpha, (p+1), (n-p-1)} \}^\frac{1}{2} \]
The first degree of freedom (\(d_1\)) is increasing in \(p\), while the second (\(d_2\)) is decreasing in \(p\).
The mean of an \(F\) distribution = \(\frac{d_2}{2-d_2}\)
The variance …
Why do we need to standardize the variables? For
\[Y = \beta_0 + \beta_1x_1 + \cdots + \beta_px_p + \epsilon \]
The scale of data doesn’t matter theoretically, but numerically it can be a problem. For instance, if one regressor is in \([0,1]\) and another is in \([0,10^{12}]\), the \(\beta\) estimates may be hard to compare. Approaches:
\[z_{ij} = \frac{x_{ij} - \overline{x_j}}{s_j} \hspace{1cm} s_j = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n(x_{ij}-\overline{x_j})^2} \]
which resembles converting a normally distributed variable to \(N(0,1)\).
So each variable has its mean subtracted, then is divided by its sample standard deviation. \(y_i\) is converted to \(y_i^*\) in the same way. Then the new model is
\[Y^* = Z\beta^* + \epsilon \]
Then to convert back, \(\beta_j = \beta_j^*/s_j\). Note that this will make \(\hat{\beta}_0^* = 0\) by construction.
The regressors are rescaled as
\[y_i^* = \frac{y_i - \overline{y}}{\sqrt{SS_T}} \hspace{1cm} w_{ij} = \frac{x_{ij} - \overline{x}_j}{\sqrt{s_{jj}}} \hspace{1cm} s_{jj} = \sum\limits_{i=1}^n(x_{ij} - \overline{x}_j)^2 \]
Then
\[\sqrt{\sum\limits_{i=1}^n(w_{ij} - \overline{w}_j)^2} = 1\]
And \(\overline{w}_j = 0\). See more textbook material for intuition and motivation.
If \(E[x_i|x_j] = E[x_i]\), then there is no linear dependence between the columns of the design matrix. In this extreme case, the moment matrix \(X'X\) will be a diagonal matrix, and \((X'X)^{-1}\) will have smaller diagonal elements than it would if there were some linear dependence between regressors. In the moment matrix, the diagonal elements are proportional to the variances of the coefficient estimates, which are minimized when there is no linear dependence.
If instead the columns of \(X\) feature some linear dependence, then the determinant of \((X'X)\) will increase with the degree of lienar dependence, inflating the values of \(\mathrm{Var}(\hat{\beta}_j)\).
So multicollinearity inflates the variance of our parameter estimates.
Recall that multicollinearity is a case where there is a high degree of linear dependence between the columns of the data matrix. Numerically, if we calculate
\[(W'W)^{-1} = \begin{bmatrix}w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix} \]
and find that \(w_{12},w_{21}\) are very large, this indicates that we may have multicollinearity. If we have a multiple regression model on four variables, \(x_1\) through \(x_4\), it may be the case that
\[x_1 \sim \alpha_0 + \alpha_2x_2 + \alpha_3x_3 + \alpha_4x_4 \]
GZ is suggesting that we could substitute the estimated RHS for \(x_1\) in our original model.
In the plots above, we have two regressors \(x_1,x_2\) for a response variable. A plane fit to the data on the left will be ‘stable’, meaning that the variance of the regression coefficient estimates will be relatively small. For the data on the right, one can imagine that the regression plane fit to this data could ‘hinge’ on the data more easily. This has the effect of inflating the variance of the coefficient estimates.
If the columns of our data matrix \((W)\) are close to orthogonal, then the off-diagonal elements of \((W'W)^{-1}\) will be small.
\[E[x_j|x_{-j}] = E[x_j]: (W'W)^{-1} = \begin{bmatrix}w_{11} & 0 \\0 & w_{22}\end{bmatrix}\]
\[E[x_j|x_{-j}] ≠ E[x_j]: (W'W)^{-1} = \begin{bmatrix}w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix}\]
One way of assessing the degree of multicollinearity is by regressing each covariate on all the others, and calculating the \(R^2\). Then we can calculate a variance inflation factor,
\[VIF_j = \frac{1}{1-R_j^2} \]
Where \(VIF_j\) is the variance inflation factor of the \(j\)th regressor, and \(R_j^2\) is the \(R^2\) of the model \(x_j = \alpha_0 + \alpha_1x_1 + \cdots + \alpha_{j-1}x_{j-1} + \alpha_{j+1}x_{j+1} + \cdots + \alpha_px_p\).
We know that \(0≤R^2≤1\), so if there is strong multicollinearity, VIF can become very large. There is not a universal rule for what level of multicollinearity is acceptable, but two common rules of thumb are that for \(VIF ≥ 4\), further investiation into the nature of the data should be done, and for \(VIF ≥ 10\), there is a multicollinearity issue.
Occasionally, the estimated regression coefficient on a particular variable will be the opposite of what was expected. There are a number of possible explanations.
The ranges of some of the regressors are too small.
Important variables were omitted from the model (omitted variable bias).
Multicollinearity. Small changes to the particular sample data in a case with high multicollinearity can lead to large swings in the parameter estimates. The model has difficulty attributing association between the dependent var and the regressors.
Computational mistake
\(Y\) and \(X\) are linearly correlated.
\(E[\epsilon] = 0\).
\(V(\epsilon) = \sigma^2\).
\(\epsilon\) are uncorrelated.
\(\epsilon\) are normally distributed.
Our estimated residuals are \(e_i = \hat{y}_i - y_i\). By construction, the residuals sum to zero, and we use them to estimate \(\sigma^2\),
\[MS_{Res} = \frac{1}{n-p} \sum\limits_{i=1}^n(e_i - \overline{e})^2 \]
To conduct residual analysis,
\[d_i = \frac{e_i}{\sqrt{MS_{Res}}} \]
\[r_i = \frac{e_i}{\sqrt{MS_{Res}(1-h_{ii})}} \]
\[e_{(i)} = \frac{e_i}{1-h_{ii}} = y_i - \hat{y}_{(i)}\]
where \(\hat{y}_{(i)}\) is the fitted value of \(y\) using all the data except for the \(i\)th observation (leave-one-out).
\[t_i = \frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}} \hspace{1cm} S_{(i)}^2 = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]
He cut off (d) early.
Residual Type (textbook terminology) | R Command | Comment |
---|---|---|
Standardized | rstandard(model, infl = influence(model1)) |
Need to modify influence(model1)$hat to equal zero |
Studentized | rstandard(model) |
|
PRESS | rstandard(model1, type="pred") |
|
R-Studentized | rstudent(model1) |
Normal probability plot (QQ Plot).
Residual against fitted value plot.
Residual against regressor plots.
We should recall the assumptions that we have made in estimating our OLS models so far.
We have a linear relationship
The errors have zero mean
The errors have constant variance
The errors are uncorrelated
The errors are normally distributed
Those are some pretty heavy assumptions, so it is important to analyze the residuals that our model produces.
Methods of residual scaling:
Standardized
Studentized
PRESS Residuals: \(e{(i)} = \frac{e_i}{1-h_{ii}} = \hat{y}_i - \hat{y}_{(i)}\)
R-student (externally standardized)
\[t_i = \frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}} \hspace{1cm} S_{(i)}^2 = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]
QQ Plot (Normal probability plot):
Residual vs \(\hat{Y}\)
Residual vs \(X_i\)
Partial regression and partial residual plots.
If we have a model with two regressors, \(x_1,x_2\), we may want to assess whether the partial relationship between \(y\) and \(x_1\) is correctly specified. Ie, is a linear relationship enough? To do this, we can use a restricted model that excludes \(x_1\):
\[\hat{y}_i(x_2) = \hat{\theta}_0 + \hat{\theta}_1x_{i2} \tag{M1}\]
We can compute the residuals from the restricted model, \(e(y|x_2) = y_i - \hat{y}_i(x_2)\).
Then we can consider a model relating our covariates:
\[\hat{x}_{i1}(x_2) = \hat{\alpha}_0 + \hat{\alpha}_1x_{i2} \tag{M2} \]
And the residuals \(e_i(x_1|x_2) = x_{i1} - \hat{x}_{i1}(x_2)\).
With these residuals in hand, we plot the residuals from M1 vs those from M2: \(e(y|x_2)\) vs \(e_i(x_1|x_2)\). If there is a linear relationship between these residuals, this indicates that a linear specification between \(y\) and \(x_1\) is sufficient in our full model, and that we do not need any transformed versions of \(x_1\), like polynomial or log-transformations.
If the plot is not linear, then we need to see why (Ch. 5).
Partial residual: \(e_i^*(y|x_1) = e_i + \hat{\beta}_1x_{i1}\). (Clarify this from textbook.)
Each observation in a regression will have its own influence on the parameter estimates. If we are interested in assessing how well our model may do at predicting new observations, we will get limited insight from looking at the residuals of the full model, since each observation affects the model fit, whereas new observations would have by definition no effecton the model fit.
It can be useful to use cross-validation on the dataset and see what the resulting residuals look like. The PRESS statistic uses leave-one-out cross validation by calculating
\[\mathrm{PRESS} = \sum\limits_{i=1}^n(y_i - \hat{y}_{(i)})^2 = \sum\limits_{i=1}^n\left(\frac{e_i}{1-h_{ii}}\right)^2\]
A lower PRESS statistic indicates that the model may have more precice predictions of new data points. The PRESS statistic can be useful for model comparison: a model with a lower PRESS statistic and lower \(R^2\) may be more useful than a model with a higher PRESS statistic and higher \(R^2\) if an objective is to predict new observations.
Briefly, cross-validation is important to ensure that a model does not overfit the data. Important in AI, deep/machine learning, and neural networks.
Example of the possibilities that open up under cross validation: If we did 5-fold, we could either fit OLS on the training data in each of the five cases, and combine the residuals, or we could change the objective function to minimize the combined residuals, which will result in a different set of parameter estiamtes.
## 1 2 3 4 5 6
## -1.69562881 0.35753764 -0.01572177 1.63916491 -0.13856493 -0.08873728
## 7 8 9 10 11 12
## 0.26464769 0.35938983 4.31078012 0.80677584 0.70993906 -0.18897451
## 13 14 15 16 17 18
## 0.31846924 0.33417725 0.20566324 -0.21782566 0.13492400 1.11933065
## 19 20 21 22 23 24
## 0.56981420 -1.99667657 -0.87308697 -1.48962473 -1.48246718 -1.54221512
## 25
## -0.06596332
## 1 2 3 4 5 6
## -1.62767993 0.36484267 -0.01609165 1.57972040 -0.14176094 -0.09080847
## 7 8 9 10 11 12
## 0.27042496 0.36672118 3.21376278 0.81325432 0.71807970 -0.19325733
## 13 14 15 16 17 18
## 0.32517935 0.34113547 0.21029137 -0.22270023 0.13803929 1.11295196
## 19 20 21 22 23 24
## 0.57876634 -1.87354643 -0.87784258 -1.44999541 -1.44368977 -1.49605875
## 25
## -0.06750861
## [1] 459.0393
We can formaly test for lack of fit of a simple model if we have repeat observations at a single value of the regressor. We can define the total residuals as
\[SS_{Res}= SS_{PE} + SS_{LOF} \]
Where PE is the pure error, and LOF is the error due to lack of fit.
In our linear model
\[ Y = X\beta + \epsilon\]
recall the assumptions we have required so far:
Errors are iid: \(\epsilon \sim N(0,\sigma^2)\),
The model is correctly specified.
In many cases, these assumptions are violated, so the data must be transformed to make OLS valid.
Residual plots are a powerful tool to check whether the assumptions about the behavior of the errors are valid. If we see misbehaving residuals after fitting an OLS model, we may have issues, whether multicollinearity, heteroskedasticity, non-linear residuals, autocorrelation, etc.
A common issue with residuals is to see a ‘fan’ pattern when plotting the residuals against the fitted values. This indicates that the variance of the residual is proportional to the fitted value. For example, this is what we would expect if the response variable were a Poisson random variable (where mean = variance). A common transformation in this case is to specify the model as
\[\sqrt{Y} = X\beta + \epsilon \]
Common transformations:
Issue | Transformation |
---|---|
\(\sigma^2\propto c\) | \(y' = y\) |
\(\sigma^2 \propto E[y]\) | \(y' = \sqrt{y}\) |
\(\sigma^2 \propto E[y](1-E[y])\) | \(y' = \sin^{-1}(\sqrt{y})\) |
\(\sigma^2 \propto [E[y]]^2\) | \(y' = \ln(y)\) |
\(\sigma^2 \propto [E[y]]^3\) | \(y' = y^{-1/2}\) |
\(\sigma^2 \propto [E[y]]^4\) | \(y' = y^{-1}\) |
In practice,
In some cases, prior knowledge or theoretical models may suggest that a relationship between variables is nonlinear, but the relationship can be transformed into a linear form. For example, an exponential relationship of the form
\[y = \beta_0 e^{\beta_1 x}\epsilon \]
can be transformed into a linear relationship using a log-transformation:
\[\ln(y) = \ln(\beta_0) + \beta_1 x + \ln(\epsilon) \]
It is important to see that performing a transformation on the error will also transform its probability distribution. In this case, in order to fit OLS on the transformed model, we require that \(\ln(\epsilon) \sim N(0,\sigma^2)\). This means that \(\epsilon \sim \mathrm{Lognormal}(0, \sigma^2)\).
Some common nonlinear relationships
Issue | Transformation |
---|---|
\(y = \beta_0x^{\beta_1}\) | \(y' = \ln(y), x' = \ln(x)\) |
\(y=\beta_0 e^{\beta_1 x}\) | \(y' = \ln(y)\) |
\(y = \beta_0 + \beta_1 \log(x)\) | \(x' = \log(x)\) |
\(y = \frac{x}{\beta_0 x - \beta_1}\) | \(y' = y^{-1}, x' = x^{-1}\) |
In cases where there is a theoretical model specification like in the table above, we can use that. But in the case where we are looking for an empirical non-linear specification, it is important to have an analytical procedure that gives an optimal functional specification relating \(y\) and \(x\).
For example, say we have a relationship that appears to be \(y = x^\alpha\), but \(\alpha\) is unknown.
Box-Cox method: make a variable transformation on \(y\):
\[y^{(\lambda)} = \left\{\begin{array}{lr} \frac{y^\lambda - 1}{\lambda\dot{y}^{\lambda-1}},& \lambda ≠ 0 \\ \dot{y}\ln(y), & \lambda = 0 \end{array}\right. \hspace{1cm} \dot{y} = \ln^{-1}\left[1/n\sum_{i=1}^n \ln(y_i)\right]\]
With this transformation, we can fit the model as
\[y^{(\lambda)} = X\beta + \epsilon \]
We then estimate \(\lambda\) using maximum likelihood by finding the value of \(\lambda\) tha minimizeds the residual sum of squares \(SS_{Res}(\lambda)\).
Similarly, we can perform a transformation on regressors to improve the functional specification of the model. To see why this might be useful, imagine a situation where \(y\) is linearly related with several regressors, but appears to have a nonlinear relationship with another regressor. Transforming \(y\) using the Box-Cox method would disrupt the relationship with the first regressors, so it might be better to transform only the regressor(s) that appear to have a nonlinear relationship.
Let the baseline model be
\[y = \beta_0 + \beta_1 x_j + \epsilon \tag{1} \]
and let \(\beta_1 x_j = \beta_0^* x_j + (\beta_1 - \beta_0^*)x_j\)
Then let the new model
\[Y = \beta_0^* + \beta_0^*x_j + \gamma\omega \hspace{1cm} \omega = x_j\ln(x_j) \tag{2}\]
Then
\[\hat{\alpha} = \frac{\hat{\gamma}}{\hat{\beta}_1} + 1 \]
So we can define
\[x_j' = \left\{\begin{array}{lr} x_j^{\hat{\alpha}},& \alpha ≠ 0 \\ \ln(x_j), & \alpha = 0 \end{array}\right. \]
Transform the response variable in a way that minimized the \(SS_{Res}\).
These methods are used to deal with cases of nonconstant variance. Under our OLS assumptions, the errors have \(\mathrm{Var}(\epsilon) = \sigma^2\mathbf{I}\). Under GLS, we relax this assumption to assume that \(\mathrm{Var}(\epsilon)\) is an invertible positive definite matrix.
There exists an \(n\times n\) invertible symmetric matrix \(K\) such that \(K'K = KK = V\). Let \(Z = K^{-1}Y\), \(B = K^{-1}X\), and \(g = K^{-1}\epsilon\). Our original model
\[Y = X\beta + \epsilon \]
now becomes
\[K^{-1}Y = K^{-1}X\beta + K^{-1}\epsilon \]
\[Z = B\beta + g \]
Then \(E[g] = E[K^{-1}\epsilon] = K^{-1}E[\epsilon] = 0\), and \(\mathrm{Var}(g) = \mathrm{Var}(K^{-1}\epsilon) = K^{-1}\sigma^2VK^{-1} = \sigma^2K^{-1}KKK^{-1} = \sigma^2\).
So the error term in our new model is \(g\). To calculate \(\hat{\beta}\), we are minimizing \(g'g = (Y-X\beta)'V^{-1}(Y-X\beta)\), which yields an estimate of
\[\hat\beta = (X'V^{-1}X)^{-1}X'V^{-1}Y \]
And \(\mathrm{Var}(\hat\beta) = \sigma^2(B'B)^{-1} = \sigma^2(X'V^{-1}X)^{-1}\). \(\sigma^2\) is estimated as usual, using the \(SS_{Res}\) from the transformed data.
But how do we get an estiamte for \(V\)? Use the estimated error covariance matrix that we get under OLS.
This tool is used in the case where the errors have unequal variance, but are not correlated with each other. Ie, \(\mathrm{Var}(\epsilon) = \sigma^2V\), where \(V\) is a diagonal matrix with entries \(1/w_i\), \(i=1, ..., n\). Then \(W = V^{-1}\) is also diagonal with entries \(w_i\).
As in GLS, we need an estimate for \(W\).
If we ignore the heteroskedasticity issue, the \(\hat\beta\) estimates are still unbiased, but they are not efficient.
Recall that GLS and WLS arise as tools to deal with the situation where the variance of the residuals in a linear model are not independent and identically distributed. Under OLS, \(\epsilon_i \sim N(0,\sigma^2)\).
WLS allows for a diagonal variance-covariance matrix of the residuals (implying that \(\mathrm{Cov}(\epsilon_i,\epsilon_j) = 0; \hspace{3mm} i ≠ j\)), while GLS allows for any symmetric variance-covariance matrix.
Under WLS, we can specify \(\mathrm{Var}(\epsilon)\) either as \(\sigma^2V\) or as \(V\). In the former case, the elements of \(V\) are weights; in the latter they are variances.
\[V = \begin{bmatrix}1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots &\vdots & \ddots & \rho \\ \rho & \rho & \cdots & 1\end{bmatrix} \hspace{1cm} \sigma^2 V =\begin{bmatrix}\sigma^2_b + \sigma^2_\epsilon & \sigma^2_b & \cdots & \sigma^2_b \\ \sigma^2_b & \sigma^2_b + \sigma^2_\epsilon & \cdots & \sigma^2_b \\ \vdots &\vdots & \ddots & \sigma^2_b \\ \sigma^2_b & \sigma^2_b & \cdots & \sigma^2_b + \sigma^2_\epsilon\end{bmatrix}\hspace{1cm} \sigma^2 = \underbrace{\rho\sigma^2}_{\sigma^2_b} + \underbrace{(1-\rho)\sigma^2}_{\sigma^2_\epsilon}\]
This structure is commonly reserved for time series analysis. The idea is that the errors are correlated with the errors of the observations that immediately preceded them. A result of this is that every error term is correlated with every other error term. A parameter to be estimated in such a model is \(\rho\),
\[V = \begin{bmatrix}1 & \rho & \cdots & \rho^{n-1} \\ \rho & 1 & \cdots & \rho^{n-2} \\ \vdots &\vdots & \ddots & \rho \\ \rho^{n-1} & \rho^{n-2} & \cdots & 1\end{bmatrix}\]
This structure is commonly reserved for time series analysis. If we have a temporal sample \(X_1, X_2, ..., X_n\), then generally, \(\mathrm{Cov}(X_i,X_j) = \sigma^2\rho^{|j-i|}\).
One issue with time series models is the assumption that the observations are equally temporally spaced.
\[V = \begin{bmatrix}1 & \rho^{\frac{|t_1-t_2|}{|t_1-t_2|}} & \cdots & \rho^{\frac{|t_1-t_2|}{|t_1-t_n|}} \\ \rho^{\frac{|t_1-t_2|}{|t_1-t_2|}} & 1 & \cdots & \rho^{\frac{|t_1-t_2|}{|t_1-t_{n-1}|}} \\ \vdots &\vdots & \ddots & \vdots \\ \rho^{\frac{|t_1-t_2|}{|t_1-t_{n}|}} & \rho^{\frac{|t_1-t_2|}{|t_1-t_{n-1}|}} & \cdots & 1\end{bmatrix}\]
In this case, the covariance matrix can take any form. No covariance structure is assumed. The \(V\) matrix is estimated using OLS and then applied to a GLS model.
\[V = \begin{bmatrix}\sigma_{11}^2 & \sigma_{12} & \cdots \sigma_{1n} \\ \sigma_{21} & \sigma_{22}^2 & \cdots & \sigma_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn}^2\end{bmatrix} \]
In Chapter 5, we dealt with variable transformations to improve the fit of a linear model. Failing to properly transform the data can result in certain observations ‘pulling’ the OLS parameter estimates as the model attempts to fit a linear model to a nonlinear relationship. In this chapter, we deal with the related issue of identifying and dealing with outlier data points in an otherwise properly transformed model.
Say we have a simple linear regression with one point distant from the rest:
To assess the leverage of each data point, we can use the hat matrix, \(H = X(X'X)^{-1}X'\). The diagonal elements \(h_{ii}\) of the \(n\times n\) hat matrix represent the leverage of the \(i\)th observation.
Note that the average of the diagonal elements is \(\frac{p}{n}\):
\[\frac{1}{n} \sum\limits_{i=1}^nh_{ii} = \frac{1}{n} \mathrm{rank}(H) = \frac{1}{n}\mathrm{rank}(X) = \frac{p}{n} \]
A rule of thumb is that any observation \(i\) with \(h_{ii} > \frac{2p}{n}\) is considered a leverage point.
Another measure of the influence of each observation \(i\) is
\[D_i = \frac{(\hat\beta_{(i)}-\hat\beta)'(X'X)(\hat\beta_{(i)}-\hat\beta)}{p\cdot MS_{Res}} \]
where \(\hat\beta_{(i)}\) is the parameter estimate with the \(i\)th observation omitted. The general form of the Cook’s distance formula is
\[D_i = \frac{(\hat\beta_{(i)}-\hat\beta)'M(\hat\beta_{(i)}-\hat\beta)}{c} \]
Where \(M\) is any \(k\times k\) matrix and \(c\) is any constant. GZ is saying that the reason for using \(X'X\) as our matrix helps us recall other test statistics like \(F\) statistics. Using the first form, an observation \(i\) with \(D_i > 1\) is an influential point. Another way of writing \(D_i\) is
\[D_i = \frac{r_i^2}{p}\cdot \frac{\mathrm{Var}(\hat y_i)}{\mathrm{Var}(e_i)} = \frac{r_i^2}{p}\cdot \frac{h_{ii}}{1-h_{ii}} \]
where \(r_i\) is the studentized residual and \(h_{ii}/(1-h_{ii})\) is the distance from \(x_i\) to the centroid of the rest of the data.
DFBETAS is a standardized measure of how much the \(\beta\) parameter estimates change when an observation is omitted.
\[\mathrm{DFBETAS}_{j,i} = \frac{\hat\beta_j - \hat\beta_{j,(i)}}{\sqrt{S_{(i)}^2\cdot c_{jj}}} \hspace{1cm} i = 1, 2, ..., n; j = 1, 2, ..., p \]
where \(c_{jj}\) is the \(j\)th diagonal element of \((X'X)^{-1}\), \(\hat\beta_{j,(i)}\) is the \(j\)th coefficient estiamte computed with the \(i\)th observation omitted, and
\[S_{(i)} = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]
where \(e_i\) is the plain residual. The cutoff point is \(2/\sqrt{n}\).
DFFITS is a standardized measure of how much the prediction \(\hat y\) changes when an observation is omitted.
\[\mathrm{DFFITS}_i = \frac{\hat y_i - \hat y_{(i)}}{\sqrt{S_{(i)}^2 h_{ii}}}\]
Cutoff is \(2\sqrt{p/n}\).