1.1 - September 29, 2020

Homework

Approximately 8 homework assignments, always due Thursdays by 11:59p. Homework will feature a combination of theoretical derivations and empirical implementation of models using R or other software.

Midterm & Final

Both exams will be take-home format, with about 5 days between assignment and due date. Midterm will be due during week 6, final will be due by the end of finals week.

Regression

Defining regression:

Regression is a model investigating the relationship betwen variables. In the most general mathematical form,

\[Y = f(X; \beta) + \epsilon\]

Where \(X\) is data, \(\beta\) are unknown parameters, and \(\epsilon\) is an error term with no relationship to \(X\).

Two primary components of regression analysis are finding \(\beta\), known as estimation, and interpreting \(\beta\), known as inference.

These models can be mechanistic or empirical. A mechanistic model may be informed by some sort of background knowledge or theory that informs the functional specification, eg. a physics, biological, or economic model. Empirical models begin without this background information and only look at the relationship between variables as expressed in the data. Our course will focus on empirical models.

It is important to distinguish between causal vs relational models. Regression analysis does not uncover causation, only association or relationships between variables.

1.2 - October 1, 2020

Recall, the basic specification for simple linear regression:

\[Y= \beta_0 + \beta_1 X + \epsilon\]

For the error term, \(\epsilon\), we have \[E[\epsilon] = 0\] \[V(\epsilon)= \sigma^2\] Ideally, \(\sigma^2\) is known, but typically it is not, so it needs to be estimated along with the \(\beta\) coefficients.

\(\S 2.2\): Least squares estimation of parameters

Say we have a scatter plot in two variables. We can draw a number of linear functions through the data. So our question is, what is the best line of fit?

## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

For a single data point under our specification,

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

\[\epsilon_i = y_i- \beta_0 - \beta_1 x_i\]

We might want to try to minimize the outliers, meaning minimize the size of \(\epsilon\) generally. We also might want to penalize large outliers. This is the idea behind least squares estimation. Our objective function becomes

\[\min\limits_{\beta_0, \beta_1} \sum_{i=1}^n[y_i - \beta_0-\beta_1 x_i]^2\]

\[S = \sum_{i=1}^n[y_i - \beta_0-\beta_1 x_i]^2\]

\[\frac{∂S}{∂\beta_0} = \sum_{i=1}^n -2[y_i - \beta_0-\beta_1 x_i]\]

\[\frac{∂S}{∂\beta_1} = \sum_{i=1}^n -2x_i[y_i - \beta_0-\beta_1 x_i]\]

To optimize, we set both equal to zero:

\[\sum_{i=1}^n y_i - \sum_{i=1}^n \beta_0 - \sum_{i=1}^n \beta_1 x_i = 0\]

\[\sum_{i=1}^n x_iy_i - \sum_{i=1}^n \beta_0x_i - \sum_{i=1}^n \beta_1 x_i^2 = 0 \]

Then, simplifying some of the summations:

\[\overline{y} = \frac{1}{n}\sum_{i=1}^ny_i\]

\[\overline{x} = \frac{1}{n}\sum_{i=1}^nx_i\]

\[\overline{xy} = \frac{1}{n}\sum_{i=1}^nx_iy_i\]

\[\overline{x^2} = \frac{1}{n}\sum_{i=1}^nx_i^2\]

we can rewrite our solutions as:

\[\overline{y} - \beta_0 - \beta_1 \overline{x} = 0\]

\[\overline{xy} - \beta_0 \overline{x} - \beta_1 \overline{x^2} = 0\]

So, our estimates can be written in terms of only the data with some substitution:

\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\] \[\hat{\beta}_1 = \frac{\overline{xy} - \bar{x}\bar{y}}{\overline{x^2} - (\overline{x})^2}\]

Qualitatively, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the values of \(\beta_0\) and \(\beta_1\) that minimize the sum of squared residuals in the original model \(Y = \beta_0 + \beta_1 X + \epsilon\).

Then expanding back to summation notation:

\[\overline{x^2} - (\overline{x})^2 \Rightarrow n\overline{x^2} - n(\overline{x})^2\]

\[= n \frac{1}{n} \sum_{i=1}^nx_i^2 - n \left(\frac{1}{n} \sum_{i=1}^n x_i\right)\left(\frac{1}{n}\sum_{i=1}^n x_i\right)\]

\[= \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i\right)\left(\frac{1}{n}\sum_{i=1}^nx_i\right)\]

\[ = \sum_{i=1}^n\left(x_i^2 - 2\overline{x}x_i + \overline{x}^2\right)\]

\[ = \sum_{i=1}^n(x_i - \overline{x})^2 \equiv S_{xx}\]

\(S_{xx}\) resembles the variance in the observed data for \(x\). Similarly,

\[S_{xy} = \sum_{i=1}^n y_i(x_i - \overline{x})\]

Alternatively, because covariance is unaffected by addition of a constant term to one of the variables,

\[S_{xy} = \sum_{i=1}^n (y_i - \overline{y})(x_i - \overline{x})\]

(left for a homework assignment).

\(S_{xy}\) is related to the numerator of our earlier expression for \(\hat{\beta}_1\), and \(S_{xx}\) is related to the denominator:

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]

\[\hat{\beta}_0 = \overline{y}- \frac{S_{xy}}{S_{xx}}*\overline{x}\]

Now, from an initial assumption that \(y = \beta_0 + \beta_1x + \epsilon\) we can write our solution as an estiamated line of best fit as \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x\). The distinction between those two equations is important. \(y\) is a random variable composed of a deterministic element and a random element. \(\hat{y}\) is the expected value of \(y\) conditioned on the data \(x\). Our estimatetells us the conditional mean of \(y\).

\[\hat{y} = E[Y|X] = \hat{\beta}_0 + \hat{\beta}_1x\]

Properties of our least squares estimates

1. \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are linear combinations of \([y_i]_{i = 1}^n\)

If \(b = \sum_{i = 1}^n a_ix_i\), if \(a_i\)s are constants, then \(b\) is a linear combination of \([x_i]_{i = 1}^n\).

\[\hat{\beta}_0 = \overline{y} - \hat{\beta}_1 \overline{x}\]

So \(\hat{\beta}_0\) is a linear combination of \([y_i]_{i = 1}^n\), where \(a_i = 1/n\).

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \overline{x})}{S_{xx}}\]

So \(a_i = \frac{x_i - \overline{x}}{S_{xx}}\).

2. \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are unbiased estimators of \({\beta_0}\) and \({\beta_1}\)

We need to show that \(E[\hat{\beta}_0] = \beta_0\) and \(E[\hat{\beta}_1] = \beta_1\)

Assume that \(E[\hat{\beta}_1] = \beta_1\).

\[E[\hat{\beta}_0] = E[\overline{Y} - \hat{\beta}_1\overline{x}] = E[\overline{Y}] - \overline{x}E[\hat{\beta}_1]\]

\[ = \frac{1}{n} \sum_{i = 1}^nE[Y_i] - \overline{x}\beta_1\]

\[ = \frac{1}{n} \sum_{i = 1}^n (\beta_0 + \beta_1x_i) - \overline{x}\beta_1\]

\[ = \beta_0 + \beta_1 \overline{x} - \overline{x} \beta_1\]

\[= \beta_0\]

2.1 - October 6, 2020

\[\begin{align*} \end{align*}\]

Recall our linear model, \(Y = \beta_0 + \beta_1X + \epsilon\). We estimate the parameters \(\beta_0\) and \(\beta_1\).

Under OLS, our parameter estimates are

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{S_{xx}}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

We had some properties of our estimates.

  1. \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are linear combinations of \([y_i]_{i = 1}^n\)
  2. \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are unbiased estimators of \(\beta_0\) and \(\beta_1\).
  3. The variances of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)

Model assumption is that \(E(Y) = \beta_0 + \beta_1X\) and \(V(Y) = \sigma^2\). So

\[V(\hat{\beta}_1) = V\left(\frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{S_{xx}}\right)\]

\[ = \frac{\sum_{i=1}^n (x_i - \bar{x})}{S_{xx}}V(Y_i)\]

\[V(\hat{\beta}_0) = V(\bar{Y} - \hat{\beta}_1\bar{X}) = V(\bar{Y}) + (\bar{X})^2V(\hat{\beta}_1) + 2Cov(\hat{\beta}_1\bar{X}, \hat{Y})\]

\[ = V\left(\frac{1}{n}\sum_{i=1}^nY_i\right) + (\bar{X})^2\frac{\sigma^2}{S_{xx}} + 2\bar{X}Cov(\hat{\beta}_1, \hat{Y})\]

\[ = V\left(\frac{1}{n}\sum_{i=1}^nY_i\right) + (\bar{X})^2\frac{\sigma^2}{S_{xx}} + 2\bar{X}(0)\]

\[ = \frac{1}{n} \sigma^2 + \frac{(\bar{X})^2}{S_{xx}}\sigma^2\]

\[ = \left(\frac{1}{n} + \frac{(\bar{X})^2}{S_{xx}}\right)\sigma^2\]

Our estimates are the best linear unbiased estimators, because we cannot find another unbiased estimator with lower variance. (How do we know? Cramer-Rao..?)

  1. Sum of residuals is

\[ = \sum_{i=1}^n(y_i - \hat{y}_i) = \sum_{i=1}^ne_i = 0\]

Sum of observations is

\[\sum_{i=1}^ny_i = \sum_{i=1}^n \hat{y}_i\]

  1. The regression line passes through the centroid of point \((x_i,y_i)_{i = 1}^n\). In other words, \(\bar{y} = \hat{\beta}_0 + \hat{\beta}_1\bar{x}\).

  2. A few more sums:

\[\sum_{i=1}^nx_ie_i = 0\]

\[\sum_{i=1}^n\hat{y}_ie_i = 0\]

Estimating \(\sigma^2\)

In practice, \(\sigma^2\) is not known, so it needs to be estimated along with the \(\beta\) parameters. Recall, \(\sigma^2 = V(\epsilon_i), i = 1, 2, ..., n\).

We use our estimated residuals \(e_i\) to estimate \(\sigma^2\). We’ll call the sum of squared errors \(SS_{Res}\).

\[\begin{align*} V(\sigma^2) = E[\sigma^2 - E[\sigma^2]] \\ \sum_{i=1}^ne_i^2 &= \sum_{i=1}^n(y_i - \hat{y}_i)^2 \\ &= \sum_{i=1}^n(y_i - \beta_0-\beta_1 x_i)^2 \\ &= \sum_{i=1}^ny_i^2 - n(\bar{y})^2 - \hat{\beta}_1S_{xy} \\ SS_T &= \sum_{i=1}^n(y_i-\bar{y})^2 \\ &= \sum_{i=1}^ny_i^2 - n(\bar{y})^2 \\ \sum_{i=1}^ne_i^2 &= SS_T - \hat{\beta}_1S_{xy} \\ SS_T &= \sum_{i=1}^ne_i^2 + \hat{\beta}_1S_{xy} \end{align*}\]

Remember, \(SS_T\) is just a constant that is calculated from the original data. It does not depend on any estimates. \(SS_{Res}\) depends on our estimates, so it is sensitive to different models. To summarize,

\[SS_{Res} = \sum_{i=1}^n(y_i - \hat{y}_i)^2 \] \[SS_T = \sum_{i=1}^n(y_i - \bar{y})^2\] \[SS_R = \sum_{i=1}^n(\hat{y}_i - \bar{y})^2\] \[SS_T = SS_{Res} + SS_R\]

We can use \(SS_{Res}\) to proceed with estimating \(\sigma^2\). \(E[SS_{Res}] = (n-2)\sigma^2\).

So \(\frac{SS_{Res}}{n-2}\) is an unbiased estimator of \(\sigma^2\).

Proof depends on characteristics of \(\chi^2\) distribution. \((y_i - \hat{y}_i)^2\) is a squared normally distributed variable \(\{N(0,\sigma^2)\}^2\).

The two beta parameters place restrictions on the degrees of freedom of the \(\chi^2\) r.v., so its degrees of freedom are \((n-2)\). If we had a total of \(k\) \(\beta\) parameters, the degrees of freedom would become \((n-k)\).

Alternative form of the model

Original is \(y_i = \beta_0 + \beta_1x_i + \epsilon_i\).

\[\begin{align*} &= \beta_0 + \beta_1x_i + \epsilon_i - \beta_1\bar{x} + \beta_1\bar{x} \\ &= (\beta_0 + \beta_1\bar{x}) + \beta_1(x_i + \bar{x}) \epsilon_i \\ &= \beta_0^1 + \beta_1(x_i + \bar{x}) \epsilon_i \end{align*}\]

The advantage of this specification is that it centers the \(x\) data at zero, so that \(\hat{\beta}_0^1 = \bar{y}\) (because now \(\bar{x} = 0\)). Also, \(\hat{\beta}_1\) is unchanged from the original.

\(\S 2.3\) - Hypothesis testing on the parameters

In our model, we will come up with some estimates for our \(\beta\) parameters. Our question in hypothesis testing is whether the true \(\beta\) value is equal to our \(\hat{\beta}\). We can never prove this, but we can test whether our observed data would be more or less likely under different hypothetical values of \(\beta\).

The hypothesis problem is Whether \(H_0: \beta_1 = \beta_{1,0}\) or \(H_1: \beta_1 ≠ \beta_{1,0}\), where \(\beta_{1,0}\) is most commonly zero.

We need to find some test statistics on our parameter estimates that satisfy some distribution. To do this, we need to define the distribution of \(Y\). This involves assumptions about the distribution of \(\epsilon\). If we assume that \(\epsilon \sim N(0,\sigma^2)\), then we can proceed.

\(Y \sim N(\beta_0 + \beta_1X, \sigma^2)\)

Recall that our parameter estimates are linear combinations of the response variable data. And linear combinations of normally distributed variables are also normally distributed variables.

2.2 - October 8, 2020

\(\S 2.3\) - Hypothesis testing. We have a baseline ‘null’ hypothesis and an alternative hypothesis.

We need to make assumptions about the behavior of \(\epsilon\) to analyze the data.

Assumptions:

  1. \(\epsilon_i \sim N(0,\sigma^2)\),

  2. \(\epsilon_i\) are independent and identically distributed

  3. For any \(1 ≤ i < j ≤n, E[\epsilon_i \epsilon_j] = E[\epsilon_i]E[\epsilon_j] = 0\)

Our test statistic, \(T\) is a function of the random variables \(Y_i\): \(T(Y_1, Y_2, ..., Y_n)\).

First, let’s construct our test statistic. There are many statistics we could create that are functions of \(Y_i\), but we want to create one that deals with workable distributions.

We know from our assumption about \(\epsilon\) that

\[Y_i \sim N(\beta_0 + \beta_1 x_i, \sigma^2)\]

\[\hat{\beta}_1 = \sum_{i=1}^n \frac{(x_i - \bar{x})}{S_{xx}}Y_i\]

So \[\hat{\beta}_1 = \sum_{i=1}^n \frac{(x_i - \bar{x})}{S_{xx}} N(\beta_0 + \beta_1x_i,\sigma^2)\]

So

\[\hat{\beta}_1 \sim N(\beta_1, \sigma^2 \frac{1}{S_{xx}})\]

We are going to test \(\beta_1\). And we see that it is related to the normal distribution because of our assumption about \(\epsilon\).

We can create a new random variable by transforming the random variable \(\hat{\beta}_1\) into a standard normal variable.

\[Z = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{\sigma^2}{S_{xx}}}} \sim N(0,1)\]

Then if we know \(\sigma^2\) we can use \(N(0,1)\).

If we don’t know \(\sigma^2\), then we need a different statistic.

Let’s proceed assuming we can use \(Z\).

Under \(H_0, \beta_1 = \beta_{1,0}\). If this is the case, then we would expect \(Z\) to be very close to 0. If \(Z\) is far from zero, then it suggests that either we captured a very unlikely dataset, or that our \(H_0\) is not correct.

But we need to define a likelihood threshold beyond which we might reject \(H_0\).

We can denote the combined shaded areas above as \(\alpha\). We can choose an arbitrary confidence level (lower \(\alpha\) corresponds to higher confidence), and then compare

\[|z| < \mathfrak{z}_{\alpha/2}\]

That is, compare our test statistic to the test statistic value that would result in a shaded area of \(\alpha\).

If we don’t know \(\sigma^2\), then we need to estimate it. Recall that an unbiased estimator for \(\sigma^2\) is

\[\hat{\sigma}^2 = \frac{SS_{Res}}{n-2}\]

\[\hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

Using this estimate, we can construct a different test statistic. It will no longer be standard normally distributed:

\[t = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]

Remember that a \(t\) distributed r.v. is the ratio of a standard normal r.v. and the square root of a \(\chi^2\) distributed random variable divided by its degrees of freedom:

\[ = \frac{N(0,1)}{\sqrt{\chi^2_p/p}}\]

To confirm that our test statistic \(t \sim t_{(n-2)}\), we need to show that this component of the denominator follows a \(\chi^2\) distribution:

\[\frac{SS_{Res}}{\sigma^2} \sim \chi^2_{(n-2)}\]

(This is left as a homework exercise.) Once we have confirmed this, then our criterion for \(H_0\) rejection is

\[|t| > t_{\frac{\alpha}{2},n-2}\]

The critical value of \(t_{\frac{\alpha}{2},n-2}\) will depend on the number of observations in the data. Smaller data sets result in t-distributions with fatter tails:

So our sequence is to observe \(x\) and \(y\), compute our OLS estimates, compute our \(t\) statistic,

ANOVA hypothesis test

We can also use analysis of variance to test whether or not \(\beta_1 = 0\). We can rewrite the quantity

\[\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]

\[SS_T = SS_{Res} + SS_{R}\]

The lefthand side has d.o.f. \(n-1\). \(SS_{Res}\) has d.o.f. \((n-2)\), and SS_R has d.o.f. 1. We can consider an \(F\) test for \(H_0: \beta_1 = 0\) vs \(H_1: \beta_1 ≠0\).

\[F = \frac{SS_R/1}{SS_{Res}/(n-2)} = \frac{MS_R}{MS_{Res}} \sim F_{1, (n-2)}\]

We will reject \(H_0\) if \(F > F_{\alpha, 1, (n-2)}\)

In the case of simple (univariate) linear regression, the \(F\) statistic is simply a squared \(t\) statistic. This is not the case in multiple linear regression. We won’t deal too much with the ANOVA method in this class.

2.4 - Interval estimation in simple linear regression

To compute our test statistic for \(\beta_0\),

\[t = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\frac{SS_{Res}}{??}}}\]

We learned already how to perform a hypothesis test on either parameter. How can we answer the question ‘how far is our estimate from the truth?’ To answer a question like this, we use confidence intervals.

\[t_1 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]

\[t_0 = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}} \sim t_{n-2}\]

Let

\[se(\hat{\beta}_1) = \sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}\]

\[se(\hat{\beta}_0) = \sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}\]

Let us construct an interval \((a,b)\) such that \(P(a < T< b) = 1- \alpha\) for \(0 < \alpha < 1\).

Our goal is to find this interval given our parameter estimates and estimated standard errors.

Step 1: find the random varaible \(T\).

Step 2: find the interval for our parameter of interest.

For \(\beta_1\), step 1 is

\[T = \frac{\hat{\beta}_1 - \beta_{1}}{se(\hat{\beta}_1)} \sim t_{n-2}\]

Step 2 is to find an interval centered on \(T\). We will save this for Tuesday.

3.1 - October 13, 2020

2.4 - Interval estimation in simple linear regression

Review: we know that we can write the test statistics for our two \(\beta\) estimators as

\[t_0 = \frac{\hat{\beta}_0 - \beta_{0,0}}{\sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}} \sim t_{n-2}\]

\[t_1 = \frac{\hat{\beta}_1 - \beta_{1,0}}{\sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}} \sim t_{(n-2)}\]

Definition:

  1. \(MS_{Res}= \frac{SS_{Res}}{n-2}\)

  2. \(s.e.(\hat{\beta}_0) = \sqrt{\left( \frac{1}{n} + \frac{(\bar{x})^2}{S_{xx}}\right)\frac{SS_{Res}}{n-2}}\)

  3. \(s.e.(\hat{\beta}_1) = \sqrt{\frac{SS_{Res}}{(n-2)S_{xx}}}\)

We are looking for an interval \((a,b)\) such that \(P(a<\beta_1<b) = 1- \alpha\) where \(\alpha\) is a confidence level between 0 and 1 (lower \(\alpha\) means higher confidence level).

Then we say that the interval \((a,b)\) is a \(100(1- \alpha)\)% Confidence Interval (C.I.) of \(\beta_1\).

\(\beta_1\) is our assumed ‘true’ parameter, which remains uknown. But weknow that \(t_1\), defined above, follows a T-distribution, so instead of \(P(a< \beta_1 < b)\), we can look at

\[P \left(a^* < \frac{\hat{\beta}_1 - \beta_1}{s.e.(\hat{\beta}_1)} < b^*\right)\]

\[==P \left(s.e.(\hat{\beta}_1)a^* < \hat{\beta}_1 - \beta_1 < s.e.(\hat{\beta}_1)b^*\right)\]

\[=P \left(\hat{\beta}_1 - s.e.(\hat{\beta}_1)b^* < \beta_1 < \hat{\beta}_1 - s.e.(\hat{\beta}_1)a^*\right)\]

Then if we rename the left and right components of the inequality as \(a\) and \(b\), then we can say that this is equivalent to

\[P (a < \beta_1 < b) = 1- \alpha\]

So now our goal is to find \(a^*\) and \(b^*\) such that \(P (a^* < t < b^*) = 1- \alpha\)

Where \(t \sim t_{n-2}\).

The blue shaded area above is equal to \(1- \alpha\).

We are looking for a test statistic value value \(c = b^* = -a^*\) from a \(t_{n-2}\) distribution that yields tails with area \(\alpha/2\).

The values of \(a\) and \(b\) that satisfy this requirement are

\[a = \hat{\beta}_1 - t_{\alpha/2, n-2}s.e.(\hat{\beta}_1)\] \[b = \hat{\beta}_1 + t_{\alpha/2, n-2}s.e.(\hat{\beta}_1)\]

The confidence interval for \(\beta_0\) follows a similar format,

\[ \left(\hat{\beta}_0 - t_{\alpha/2, n-2}s.e.(\hat{\beta}_0), \hat{\beta}_0 + t_{\alpha/2, n-2}s.e.(\hat{\beta}_0)\right)\]

Good Q from a student on where the d.o.f. for the test statistics comes from: why \((n-2)\)? In our model, we have two parameters (slope & intercept). so the residual sum of squares loses two degrees of freedom (out of an original \(n\)).

\[\frac{(n-2)MS_{Res}}{\sigma^2} \sim \chi_{n-2}^2\]

So the confidence interval for \(\sigma^2\) is

\[\left( \frac{(n-2)MS_{Res}}{\chi_{\frac{\alpha}{2},n-2}^2}, \frac{(n-2)MS_{Res}}{\chi_{1- \frac{\alpha}{2},n-2}^2} \right)\]

So we can construct CIs for all three of our parameters at a given confidence level. Now we want the confidence level of our condidional expected value, \(E[Y|X_0]\). This is the average of predictions at \(x_0\).

\(E[Y|X_0] = \beta_0 + \beta_1x_0\)

\[E[\hat{\beta}_0 + \hat{\beta}_1x_0] = \beta_0 + \beta_1x_0\]

\[V(\hat{\beta}_0 + \hat{\beta}_1x_0) = \frac{\sigma^2}{n} + \frac{\sigma^2}{S_{XX}}(x_0 - \bar{x})^2\]

If \(\sigma^2\) is known, then

\[\hat{\beta}_0 + \hat{\beta}_1x_0 \sim N(\beta_0 + \beta_1x_0, \frac{\sigma^2}{n} + \frac{\sigma^2}{S_{XX}}(x_0 - \bar{x})^2)\]

If \(\sigma^2\) is unknown, then

\[\frac{\hat{\beta}_0 + \hat{\beta}_1x_0- E[Y|x_0]}{\sqrt{\left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)MS_{Res}}} \sim t_{n-2}\]

So we can construct a CI for our conditional expected value of \(y\) at \(x_0\)

In the last case, we used \(\hat{\beta}_0 + \hat{\beta}_1x_0 \sim N(.,.)\)

In this case, \(y - \hat{y} = \epsilon \sim N(0,\sigma^2)\). \(y\) is our target, \(\hat{y}\) is our estimate.

If \(\sigma^2\) is known, then

\[\frac{y-\hat{y}}{\sqrt{\sigma^2}} \sim N(0,1)\]

But if \(\sigma^2\) is known, then we need to find \(E[\epsilon]\) and \(V(\epsilon)\) before we can proceed.

\[E[\epsilon] = E[y-\hat{y}] = 0\]

\[V(\epsilon) = V(y-\hat{y}) = V(y) + V(\hat{y})\]

(what about covariance?)

\[= \sigma^2 + \sigma^2\left( \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right) = \sigma^2\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)\]

So the CI for \(y\)

\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}})}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}})} \right)\]

This is a confidnce interval for the predicted value of \(y\) at a particular point \(x_0\).

If we want the average confidence interval over a set of \(m\) points, then we take the average of CIs,

\[\frac{1}{m} \sum_{i = 1}^m y_i - \frac{1}{m} \sum_{i = 1}^m \hat{y}_i\]

(Need to clarify the above, it was presented as an aside.)

Coefficient of Determination

\(SS_T = SS_{Res} + SS_R\), ie

\[\sum_{i=1}^n (y_i - \overline{y})^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \sum_{i=1}^n (\hat{y}_i - \overline{y})^2\]

from the first homework.

Define \(R^2 = \frac{SS_R}{SS_T} = 1-\frac{SS_{Res}}{SS_T}\)

Then \(R^2\) quantifies the proportion of variation in \(y\) explained by variation in \(x\).

If our model results in zero errors, then \(SS_{Res} = 0\), so \(R^2 = 1\).

If our \(\hat{\beta}_1 = 0\) then \(\hat{y}_i = \bar{y}\) so \(SS_R = 0\) and \(R^2 = 0\).

R Code

See D2L for some introductory code on manually estimating OLS, as well as the lm() function.

3.2 - October 15, 2020

Estimating OLS in R

##   purity hydro
## 1  86.91  1.02
## 2  89.85  1.11
## 3  90.28  1.43
## 4  86.34  1.11
## 5  92.58  1.01
## 6  87.33  0.95
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## x            -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

## [1] 1106.559
## [1] -41112.65
## [1] -37.15359
## [1] 1693738
## [1] 166254.9
## [1] 2.889107
## [1] 59.47464

Above are two confidence bands. The wider blue band is for \(y_i\). The narrower red band is for \(\hat{y}_i\). More on this below.

In R, we can find a quantile of a T-distribution using qt():

\(qt(0.025, n-2) = -t_{0.05/2, n-2}\)

\[a = \hat{\beta}_1 + qt(0.025, n-2)s.e.(\hat{\beta}_1)\]

\[b = \hat{\beta}_1 + qt(0.975, n-2)s.e.(\hat{\beta}_1)\]

So our confidence interval for \(\hat{\beta}_1\) is \((a,b)\).

We can plot confidence bands for \(E[Y|X = x]\) and for \(Y|X = x\). The former takes into account uncertainty due to our estimators; the latter additionally takes into account the uncertainty due to \(\epsilon\).

CI of \(E[Y|X = x]\):

\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(\frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)} \right)\]

CI of \(Y|X = x\):

\[\left( \hat{y}_0 - t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)}, \hat{y}_0 +t_{\frac{\alpha}{2},n-2}\sqrt{MS_{Res}\left(1+ \frac{1}{n} + \frac{(x_0 - \bar{x})^2}{S_{xx}}\right)} \right)\]

This is an important distinction to make. Uncertainty about the expected value of y is due to uncertainty about the parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that are used to generate an estimate of it. So the narrow bands around the estimated expected value of \(Y\) are due to the variance of the OLS estimators.

On the other hand, uncertainty about the random variable \(Y\) is due to the above uncertainty, plus the inherent variance of \(Y\) due to the \(\epsilon\) term in our original specification. See the October 13 lecture notes for derivation of the different variances of these two objects.

Consideration in Regression

With our OLS estimates, we can compute \(\hat{y}\) at some point \(x_0\). Let’s say \(x_0 = 30\).

But if we are going to make estimations outside of our observed data range, we should be careful that the linear relationship still holds.

In the above graph, assume we only observe datas of xs below 20. We risk making bad predictions if we try to fit values of ys for hypothetical values of xs above 20 using the black OLS fit.

4.1 - October 20, 2020

Some help with plotting confidence and prediction intervals in R.

Remember that when using the predict() function, use the interval argument to stipulate which type of interval you want.

Be sure to sort your data so that if you do a line plot, it’s connecting the dots in the proper order.

Considerations in regression

  1. The range of your regressors. There is nothing to suggest that the functional specification that appears to match the existing data will also match new data.

  2. Disposition of \(X\).

  3. Outliers. Certain data points will have a lot of leverage on the parameter estimates.

  4. Relation is not causation.

  5. Regression through the origin. There may be some theory that tells us we should expect the functional relationship to have no intercept term.

Maximum Likelihood method

\(Y = \beta_0 + \beta_1X + \epsilon\). In other words, \(Y \sim N(\beta_0 + \beta_1X, \sigma^2)\).

So the likelihood of our data set is

\[ L(y_i, x_i, \beta_0,\beta_1,\sigma^2) = \prod_{i=1}^n(2 \pi \sigma^2)^{-1/2} \exp\left[-\frac{1}{2\sigma^2}(y_i - \beta_0 - \beta_1x_i)^2\right]\]

\[ = (2 \pi \sigma^2)^{-n/2} \exp\left[-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i)^2 \right]\]

As in OLS, we are trying to maximize an objective function with respect to our \(\beta\) parameters. We can denote our estimates \(\tilde{\beta}\) to distinguish them from the OLS estimators \(\hat{\beta}\).

First, we can log transform the likelihood function. Its maximum will be preserved under a monotonic transformation.

\[\ln L(y_i, x_i, \beta_0,\beta_1,\sigma^2) = -\frac{n}{2}\ln 2 \pi - \frac{n}{2} \ln \sigma^2 - \frac{1}{2 \sigma^2} \sum_{i=1}^n (y_i - \beta_0 - \beta_1x_i)^2 \]

So our estimators will satisfy

\[ \frac{∂ \ln L}{∂ \beta_0} = \frac{1}{\tilde{\sigma}^2}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i) = 0\]

\[ \frac{∂ \ln L}{∂ \beta_1} = \frac{1}{\tilde{\sigma}^2}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)x_i = 0\]

\[ \frac{∂ \ln L}{∂ \sigma^2} = -\frac{n}{2\tilde{\sigma}^2} + \frac{1}{2\tilde{\sigma}^4}\sum_{i=1}^n (y_i - \tilde{\beta}_0 - \tilde{\beta}_1x_i)^2 = 0\]

If these first-order necessary conditions for likelihood maximization look familiar, it’s because they’re identical to the FONC for OLS, so \(\tilde{\beta} = \hat{\beta}\), although \(\tilde{\sigma}^2 ≠ \hat{\sigma}^2\).

So to compare and contrast the two methods:

Under OLS, we only need \(E[\epsilon] = 0\) and \(V(\epsilon) = \sigma^2\). For ML, we need \(\epsilon \sim N(0,\sigma^2)\).

The \(\beta\) estimates are unbiased under both methods.

Under ML, \(\tilde{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\), which is biased. The OLS estimate was \(\hat{\sigma}^2 = \frac{1}{n-2}\sum_{i=1}^n(y_i - \beta_0 - \beta_1x_i)^2\). However, both estimators are consistent.

In addition to estimating the \(\beta\) coefficients, we are interested in the correlation coefficient, \(\rho\), between \(X\) and \(Y\).

\[\rho = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]

\[ = \frac{E(XY) - E(X)E(Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]

Our estimator for \(\rho\) is

\[r = \frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}\]

And we know

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{r\sqrt{S_{xx}SS_T}}{S_{xx}} = r\sqrt{\frac{SS_T}{S_{xx}}}\]

4.2 - October 22, 2020

Building on the construction of \(\rho\) from Tuesday.

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}} = \frac{r\sqrt{S_{xx}SS_T}}{S_{xx}} = r\sqrt{\frac{SS_T}{S_{xx}}}\]

\[r = \frac{\sqrt{S_{xx}}}{\sqrt{SS_T}}\hat{\beta}_1\]

\[ r^2 = \frac{S_{xx}}{SS_T}\left(\hat{\beta}_1 \right)^2\]

\[\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}\]

So,

\[r^2 = \frac{SS_{Res}}{SS_T}\]

So \(R^2\) is an estimator for \(\rho^2\). We are interested in testing whether a data set has \(\rho= 0\). In other words, we want to test a null hypothesis that no variation in the response variable is associated with variation in the covariate.

The test statistic is

\[t_0 = r\frac{\sqrt{n-2}}{\sqrt{1-r^2}} \sim t_{n-2}\]

where

\[r = \sqrt{\frac{S_{xy}}{SS_T}}\hat{\beta}_1\]

If \(n\) is large (greater than 50 or so),

\[ z = \frac{1}{2}\ln\frac{1+r}{1-r} \hspace{5mm} \stackrel{D}{\longrightarrow} \hspace{5mm} N\left(\frac{1}{2}\ln\frac{1+\rho}{1-\rho}, \frac{1}{n-3} \right)\]

Because

\[\rho = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X) \mathrm{Var}(Y)}}\]

\(X\) and \(Y\) are random variables under our model specification.

When we write the relationship as

\[Y|X=x \sim N(\beta_0 + \beta_1X,\sigma^2)\]

we assume that the \(X\) values are fixed. But if we want to treat both as random variables, then we need to estimate the joint pdf of \(X,Y\):

\[f_{X,Y}(x,y) = f_{Y|X}(y|x)f_X(x) \]

which comes from basic conditional probabilities, eg.

\[P(A|B)P(B) = P(A \cap B)\]

Special Case

If \(X,Y\) are both normally distributed,

\[ f_{X,Y}(x,y) = \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}} \exp \left\{ -\frac{1}{2(1-\rho^2)}\left[ \left(\frac{y-\mu_1}{\sigma_1} \right)^2 + \left(\frac{x-\mu_2}{\sigma_2} \right)^2 + 2\rho\left(\frac{y-\mu_1}{\sigma_1} \right)\left(\frac{x-\mu_2}{\sigma_2} \right)\right] \right\}\]

If we want, we can calculate \(E[Y|X]\) and another value or two (see video around 58 min).

Chapter 3 - Multiple linear regression

Let’s extend our model from Chapter 2 to estimate a relationship specified as

\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]

Doing this expands our possible hypothesis tests. We could test whether certain parameters are equal to each other; whether multiple parameters are jointly zero, and more. First, some matrix algebra.

C.2.1 - Matrix Algebra

Notation:

\(\mathbf{A}, \mathbf{B}\) - Matrices

\(a, b\) - Scalars

\(X, Y, Z\) - Random Variables

\(\overline{X}, \overline{Y}, \overline{Z}\) - Random Vectors

\(a_{1,3}\) - Element in row 1, column 3 of matrix \(\mathbf{A}\).

\(\mathbf{A_{11}}\) - for block matrices

Definitions:

  1. Rank of \(\mathbf{A}\): number of independent columns of \(\mathbf{A}\).

\(\mathbf{A} = (\vec{a}_1, \vec{a}_n, ..., \vec{a}_n)\).

If there is an element \(\vec{a}_i\) in \(\mathbf{A}\) that can be written as

\[\vec{a}_i = \alpha_1 \vec{a}_1 + \alpha_2 \vec{a}_2 + ... + \alpha_n \vec{a}_n\]

Then \(\vec{a}_i\) is not independent. So the rank of a matrix is at most \(n\).

  1. Identity matrix: a matrix with 1 in every diagonal element, all other elements zero.

5.1 - October 27, 2020

Midterm

Posted on D2L Friday, November 6, will be due Friday, November 13.

The midterm will be focused on data analsyis and writing a professional report. The report should tell a story about the data, explain the model specification, and report and interpret the results qualitatively.

Raw R code and output should be saved for the appendix.

Homework 3 posted Thursday, October 29, due Thursday, November 5. Possibly homework 4 concurrent with midterm (not decided yet).

C.2.1 - Matrix operations

  1. The rank of \(\mathbf{A}\) is the number of linearly independent columns of \(\mathbf{A}\).

  2. An identity matrix is a matrix with 1s on the diagonal, zero otherwise.

\[\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\]
  1. An inverse matrix is a matrix that generates an identity matrix when multiplied by its base.

\[ \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}\]

  1. \(\mathbf{A}^T\) interchanges the row and column indices of \(\mathbf{A}\). It is the case that \((\mathbf{A}^T)^T = \mathbf{A}\), and \((\mathbf{A}\mathbf{B})^T = \mathbf{B}^T\mathbf{A}^T\).

  2. If \(\mathbf{A}^T = \mathbf{A}\), then \(\mathbf{A}\) is a square matrix.

  3. An idempotent matrix satisfies \(\mathbf{A}\mathbf{A} = \mathbf{A}\).

  4. Orthonormal matrix: A square matrix \(\mathbf{A}_{k \times k}\) is orthonormal if \(\mathbf{A}^T \mathbf{A} = \mathbf{A}\). If \(\mathbf{A}\) is orthonormal, then \(\mathbf{A}^{-1} = \mathbf{A}^T\).

  5. Quadratic form of matrix: If we have

\[y = \begin{bmatrix} y_1 \\ y_2 \\ \cdots \\ y_n \end{bmatrix} \hspace{5mm} \mathbf{A}_{n\times n} \hspace{5mm} y^tAy = \sum_{i = 1}^n\sum_{j = 1}^n a_{ij}y_iy_j\]

a scalar.

  1. Positive definite and positive semidefinite matrices: if \(\mathbf{A}^T = \mathbf{A}\) and \(y^T \mathbf{A} y > 0\) for any \(y ≠0\), then we call \(\mathbf{A}\) a positive definite matrix.

If \(\mathbf{A}^T = \mathbf{A}\) and \(y^T \mathbf{A} y ≥ 0\) for any \(y ≠0\), then we call \(\mathbf{A}\) a positive semidefinite matrix.

  1. If we have a matrix \(\mathbf{A}_{k \times k}\), then the trace of the matrix is the sum of its diagonal elements:

\[ \mathrm{tr}(\mathbf{A}_{k \times k}) = \sum_{i = 1}^k a_{ii} \]

  1. Rank of an idempotent matrix equals its trace.

  2. Matrix partition. If we have \(\mathbf{X}_{n\times p}\), we can write it as

\[\mathbf{X} = \begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} \]

Algebra on partitioned matrices:

\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} = \begin{bmatrix} \mathbf{X}_1 \mathbf{X}_2 \end{bmatrix} \]

\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}_1 = \mathbf{X}_1 \]

\[\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}_2 = \mathbf{X}_2 \]

  1. Inverse of partitioned matrix: for

\[ \mathbf{X}'\mathbf{X} = \begin{bmatrix}\mathbf{X}_1'\mathbf{X}_1 & \mathbf{X}_1'\mathbf{X}_2 \\ \mathbf{X}_2'\mathbf{X}_1 & \mathbf{X}_2'\mathbf{X}_2 \end{bmatrix}\]

The inverse is listed in appendix C.2.1 (p 579).

C.2.2 - Matrix Derivatives

For constants \(\mathbf{A}_{k\times k}\) and \(\mathbf{a}_{k \times 1}\), and variable vector \(\mathbf{y}_{k \times 1}\),

\[\frac{∂\mathbf{a}'\mathbf{y}}{∂\mathbf{y}} = \mathbf{a} \hspace{1cm} \frac{∂\mathbf{y}'\mathbf{y}}{∂\mathbf{y}}= 2\mathbf{y} \hspace{1cm} \frac{∂\mathbf{a}'\mathbf{A}\mathbf{y}}{∂\mathbf{y}} = \mathbf{A}'\mathbf{a} \hspace{1cm} \frac{∂\mathbf{y}'\mathbf{A}\mathbf{y}}{∂\mathbf{y}} = \mathbf{A}\mathbf{y}+ \mathbf{A}'\mathbf{y}\]

  1. Expectations and variance of random vectors. For $_{n 1},

\[E[\mathbf{y}] = \mathbf{\mu}_{n\times 1} \hspace{1cm} V(\mathbf{y}) = \begin{bmatrix} V(\mathbf{y}_1) & Cov(\mathbf{y}_1,\mathbf{y}_2) & \cdots & Cov(\mathbf{y}_1, \mathbf{y}_n) \\ Cov(\mathbf{y}_2, \mathbf{y}_1)& V(\mathbf{y}_2) & \cdots & Cov(\mathbf{y}_2, \mathbf{y}_n) \\ \cdots & \cdots & \cdots & \cdots \\ Cov(\mathbf{y}_n, \mathbf{y}_1) & Cov(\mathbf{y}_n, \mathbf{y}_2) & \cdots & V(\mathbf{y}_n) \end{bmatrix}\]

  1. Random vector transformation (C.2.4) suppose \(\mathbf{y}_{n\times 1} \sim N(\mu, \mathbf{V})\) and \(\mathbf{U} = \mathbf{y}'\mathbf{A}\mathbf{y}\), then if \(\mathbf{A}\mathbf{V}\) or \(\mathbf{V}\mathbf{A}\) is idempotent of rank \(p\), then

\[\mathbf{U} \sim \chi^2_{p, \lambda} \hspace{1cm} \lambda = \mu'\mathbf{A}\mu\]

Example: If \(\mu = 0\), then \(\mathbf{y}\) is centered, so \(\lambda\) will also be zero.

Example: If \(\mathbf{V} = \sigma^2\mathbf{I}\) and \(\mathbf{A}\) is idempotent of rank \(p\),

\[\frac{\mathbf{U}}{\sigma^2} \sim \chi^2_{p, \lambda} \]

The latter will return for hypothesis testing.

  1. Independence: If \(E[\mathbf{y}] = \mu\) and \(V(\mathbf{y}) = \mathbf{V}\), for

\[ W_{m\times 1} = B_{m\times n}y_{n\times 1} \hspace{1cm} U_{1\times 1} = y'_{1\times n}A_{n\times n}y_{n\times 1} \]

Then \(U,W\) are independent if \(\mathbf{B}\mathbf{V}\mathbf{A} = 0\)

Chapter 3: Multiple linear regression

§ 3.1 - Introduction

Our new model, building on the simple model from Ch 2 is

\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_px_p + \epsilon\]

We also have the possibility of regressing on variable transformations and interactions, eg

\[ Y = \beta_0 + \beta_1x_1 + \beta_2x_1^2 + \beta_3x_2 + \beta_4x_1x_2 + \epsilon\]

Our inference also has additional possibilities. We can test multiple parameters simultaneously.

§ 3.2 - Estimation

Assumptions: no interaction terms or variable transformations. \(E[\epsilon] = \vec{0}, V(\epsilon) = \sigma^2\mathbf{I}_n\).

We can write our model in matrix notation

\[\begin{bmatrix}y_1 \\y_2 \\ \cdots \\ y_n \end{bmatrix} = \begin{bmatrix}1 & x_{11} & \cdots & x_{1p} \\ \cdots & & & \cdots \\ 1 & x_{n1} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix}\beta_0 \\ \beta_1 \\ \cdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \cdots \\ \epsilon_n \end{bmatrix}\]

Our sum of squares objective function is identical to the simple case, and we area again looking for

\[\frac{∂S(\beta)}{∂\beta} = 0 \]

5.2 - October 29, 2020

Restate our multiple linear model:

\[Y = X\beta + \epsilon \]

Where now, \(X\) represents an \(n\times (p+1)\) matrix of covariates,

Ge’s notation is \(\beta_0 ... \beta_p\) with \(p+1\) parameters.

Textbook is \(\beta_0 ... \beta_k\) with \(\overline{p} = k+1\) parameters.

Check out the confidenceEllipse() function.

To perform OLS on matrices, our objective function is

\[S(\beta) = (Y - X\beta)'(Y - X\beta) = Y'Y - 2\beta'X'Y + \beta'X'X\beta \]

\[\frac{∂S}{∂\beta} = -2X'Y + 2X'X\beta \stackrel{\mathrm{set}}{=} 0 \]

\[(X'X)^{-1}X'X\beta = (X'X)^{-1}X'Y \]

\[ \hat{\beta} = (X'X)^{-1}X'Y\]

This estimator beahves just like the simple version in terms of calculating point estimates for \(\hat{Y} = X\hat{\beta}\).

We call the matrix \(X(X'X){-1}X' = H\) (I think this is the ‘projection’ matrix from Greene). Then

\[e = Y - \hat{Y} = Y- HY = (1-H)Y\]

Geometric interpretation of \(\hat{\beta}\)

The observed data \(Y\) is an n-dimensional vector, so it is in \(\mathbb{R}^n\).

The covariates consist of \(k\) n-dimensional vectors, typically with \(k < n\). There is no guarantee that \(Y\) can be replicated as a linear combination of the vectors in \(X\). The columns of \(X\) may only span \(\mathbb{R}^k\) (or less in the case of linear dependence). \(X\beta\) is a vector in the ‘estimation space’, which may get close to \(Y\).

\[SS_{Res} = (Y-X\hat{\beta})'(-X\hat{\beta}) \]

\[= Y'Y - 2\hat{\beta}'X'Y + \hat{\beta}'X'X\hat{\beta} = Y'Y - 2\hat{\beta}'X'Y +\hat{\beta}'X'\hat{Y} = \boxed{Y'Y - \hat{\beta}'X'Y} \]

Estimate for \(\sigma^2\) is \(SS_{Res}/(n-p)\), under MLE it is divided by \(n\).

6.1 - November 3, 2020

§3.3 - Hypothesis Testing

We have a multiple linear regression model,

\[Y = X\beta + \epsilon \hspace{1cm} Y \in \mathbb{R}^n, \hspace{2mm} X_{n \times (p+1)}, \hspace{2mm} \beta \in \mathbb{R}^{p+1}\]

Significance of regression

We want to test whether there is a linear relationship between the response variable and the regressors. \(H_0: \beta = \mathbf{0}, H_1:\) at least one \(\beta_j ≠0\) for \(j = 1, 2, ... p\). To test this, we will use the sum of squares values used in the construction of the ANOVA table.

\[SS_T = (Y-\overline{Y})'(Y- \overline{Y}) \hspace{1cm} SS_{Res} = Y'Y - \hat{beta}'X'Y \hspace{1cm} SS_R = \hat{\beta}'X'Y - \frac{1}{n}Y'\mathbf{1}_{n\times 1}\mathbf{1}_{1 \times n}'Y\]

With those values in hand, we can write that

\[\frac{SS_{Res}}{\sigma^2} \sim \chi_{n-p-1}^2 \hspace{1cm} \frac{SS_R}{\sigma^2} \sim \chi_p^2 \]

And we can generate a test statistic,

\[F_0 = \frac{SS_R/\sigma^2}{SS_{Res}/\sigma^2} = \frac{SS_R}{SS_{Res}} \sim F_{p,\hspace{1mm} n-p-1}\]

And we can proceed to test this using any desired significance level. Similar to the simple case,

\[E[MS_{Res}] = \sigma^2 \hspace{5mm}\Rightarrow \hspace{5mm} MS_{Res} = \hat{\sigma}^2 \]

\[E[MS_R] = \sigma^2 + \frac{\beta_c'X_c'X_c\beta_c}{p\sigma^2} \]

Where \(\beta_c\) is the \(\beta\) vector minus the constant term, and \(X_c\) is the data matrix without the column of ones, and where every element has its variable mean subtracted from it \((x_{ij} - \overline{x}_j)\)

Then, in the ANOVA table, we have the values shown in Ch 3.3 - Table 3.5 (p 85).

\(F\) Statistics

The critical value of an F test depends on the number of parameters and the number of observations.

\(R^2\) and \(\overline{R}^2\)

As in the simple case,

\[R^2 = \frac{SS_R}{SS_T} = 1 - \frac{SS_{Res}}{SS_T} \]

In the multiple regression case, we have \(\hat{Y} = X\hat{\beta}\). We can define \(Y* = Y-\hat{Y} = Y - X\hat{\beta}\). If we find another set of regressors \(X*\), we can fit a new model \(Y* = X*\beta* + \epsilon*\).

The \(SS_T\) of the second model will be equal to the \(SS_R\) from the first model.

\(SS_{Res} = SS_R^* + SS_{Res}^*\)

So we are regressing the residuals from the original model on a new set of regressors.

Then \(SS_R + SS_R*\)

## Loading required package: carData
## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## x1         1 115.07  115.07  14.119 0.000877 ***
## Residuals 26 211.90    8.15                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value   Pr(>F)   
## x2         1  76.193  76.193  7.8998 0.009272 **
## Residuals 26 250.771   9.645                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x1         1 115.068 115.068  22.378 7.492e-05 ***
## x2         1  83.343  83.343  16.208 0.0004635 ***
## Residuals 25 128.553   5.142                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x2         1  76.193  76.193  14.818 0.0007288 ***
## x1         1 122.218 122.218  23.768 5.148e-05 ***
## Residuals 25 128.553   5.142                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type II tests)
## 
## Response: y
##            Sum Sq Df F value    Pr(>F)    
## x1        122.218  1  23.768 5.148e-05 ***
## x2         83.343  1  16.208 0.0004635 ***
## Residuals 128.553 25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Anova Table (Type II tests)
## 
## Response: y
##            Sum Sq Df F value    Pr(>F)    
## x2         83.343  1  16.208 0.0004635 ***
## x1        122.218  1  23.768 5.148e-05 ***
## Residuals 128.553 25                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You can see in the table above that the Sum Sq values for a given variable are different depending on what order the covariates appear in the regression. It looks like anova() evaluates it as:

\[\begin{array}{ll} \\ x_1 & SS_R(\beta_1) \\ x_2 & SS_R(\beta_2|\beta_1) \\ x_3 & SS_R(\beta_3|\beta_1,\beta_2) \\ \cdots & \cdots \\ x_p & SS_R(\beta_p|\beta_1,\beta_2, ... \beta_{p-1}) \\ \end{array} \]

Meanwhile, the Anova function evaluates it as:

\[\begin{array}{ll} \\ x_1 & SS_R(\beta_{-1}) \\ x_2 & SS_R(\beta_{-2}) \\ x_3 & SS_R(\beta_{-3}) \\ \cdots & \cdots \\ x_p & SS_R(\beta_{-p}) \\ \end{array} \]

Because \(R^2\) is increasing as we add additional regressors, we can construct a measure of model fit that penalizes additional regressors:

\[\overline{R}^2 = 1- \frac{SS_{Res}/(n-p-1)}{SS_T/(n-1)} \]

Tests on individual coefficients and subsets of coefficients

For \(H_0: \beta_j = 0\) vs \(H_1: \beta_j ≠ 0\), we can use a \(t\) statistic with the normal confidence interval interpretation:

\[t = \frac{\hat{\beta}_j}{\sqrt{\sigma^2c_{jj}}} = \frac{\hat{\beta}_j}{s.e.(\hat{\beta}_j)} \sim t_{n-p-1} \hspace{1cm} C = (X'X)^{-1}\]

We can also test individual coefficients using \(F\) statistics. If we want to test \(H_0: \beta_2 = 0\) in a model of the form \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\), we essentially have a ‘full’ (unrestricted) model and a ‘reduced’ model (restricted).

Full: \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon\)

Reduced: \(Y = \beta_0 + \beta_1x_1 + \epsilon\)

For the full model,

\[MS_{Res} = \frac{Y'Y - \hat{\beta}'X'Y}{n-p-1} \]

For the reduced model,

\[SS_R(\beta_1) = \hat{\beta}_1'X_1'Y \]

So

\[SS_R(\beta_2 | \beta_1) = SS_R(\beta_1,\beta_2) - SS_R(\beta_1)\]

Where $SS_R(_1,_2) $ is the \(SS_R\) from the full model.

We can intepret this as saying that \(SS_R(\beta_2 | \beta_1)\) explains how much \(x_2\) contributes to the fit of the model. Aka the amount by which \(SS_R\) increases by adding \(x_2\) to a model with only \(x_1\).

So we want to assess whether \(SS_R(\beta_2 |\beta_1)\) is big or small.

\[F = \frac{SS_R(\beta_2 |\beta_1)/r}{MS_{Res}} \sim F_{r, \hspace{1mm} (n-p-1)} \]

Where \(r\) is the number of restrictions being tested. This means that the number of variables in the restricted model is \(p-r\).

\(SS_R(\beta_2 |\beta_1)\) has \(r\) dof.

\(SS_R(\beta_1, \beta_2)\) has \(p\) dof.

\(SS_R(\beta_1)\) has \(p-r\) dof.

So if we have a large value of \(F\), it indicates that it is likely that at least one of the parameters in \(\beta_2≠0\).

We can run this test on any subset of our \(p\) regressors. For example, we could test a single coefficient by finding \(SS_R(\beta_j|\beta_{-j})\). In this case, variable \(j\) is the ‘last’ variable added to the model, and we are testing to see whether it adds significantly to the regression sum of squares. This is the value being reported in the last row of the anova() function.

6.2 - November 5, 2020

GZ is clarifying the difference between the anova function in base R and the Anova function from the car package.

anova conditions the \(SS_{R}\) in each row on all the variables in rows above it. So the sum of the \(SS_R\) values will equal the entire model’s \(SS_R\).

Anova conditions the \(SS_{R}\) in each row on all other variables. So the sum of the \(SS_R\) values will exceed the entire model’s \(SS_R\).

§3.3.4 - Testing the General Linear Hypothesis

We have considered a number of tests:

  • Significance of regression using the \(F\) statistic

  • Tests on individual coefficients, and subsets of coefficients (\(t\) and \(F\) statistics)

Today we will look at general linear hypothesis testing. Our model is

\[Y = X\beta + \epsilon \]

Our testing problem is \(H_0: T\beta = c\) vs \(H_1: T\beta ≠ c\).

Recall, \(\beta\) is a vector of length \(p+1\), where \(p\) is the number of regressors.

In this test, \(T\) is an \(m \times (p+1)\) matrix, so \(c\) will be a vector of length \(m\).

This method allows us to test hypotheses like \(H_0: \beta_1 = \beta_3\). If we have a four-regressor model

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon \tag{FM}\]

Then we can write the hypothesis above as \(H_0: T\beta = 0\) where \(T = \begin{bmatrix}0 & 1 & 0 & -1 & 0\end{bmatrix}\).

We can also test multiple hypotheses simultaneously:

\[H_0: \beta_2 = \beta_4, \hspace{3mm} \beta_3 = 0 \hspace{1cm} H_1: \beta_2 ≠ \beta_4, \hspace{3mm} \beta_3 ≠ 0\]

In which case we would need \(T\) to be a \(2\times 5\) matrix.

\[T\beta = \begin{bmatrix}0&0&1&0&-1 \\ 0&0&0&1&0\end{bmatrix}\beta = c = \begin{bmatrix}0\\0\end{bmatrix} \]

Under these general hypothesis tests, \(T\) has \(r\) linearly independent columns, where \(r\) is the number of restrictions.

If (1) is the full model, we could write the resricted model as

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_2+x_4) + \epsilon = Z\gamma + \epsilon \tag{RM}\]

Then \(SS_{Res}(RM) = Y'Y - \hat{\gamma}'Z'Y\), which has \((n-p+r-1)\) dof.

(Should clarify what \(r\) is.)

We can calculate \(SS_H = SS_{Res}(RM) - SS_{Res}(FM)\), which has \(r\) dof (\((n-p+r-1)\) and \(n-p-1\) dof, respectively).

We test this using an \(F\) statistic, calculated as

\[F = \frac{SS_H/r}{SS_{Res}(FM)/(n-p-1)} \sim F_{r,\hspace{1mm} (n-p-1)}\]

Then we evaluate this statistic to test our hypothesis, which is the original set of restrictions.

Partial F tests in R

Save the full and reduced models as lm objects, then run

## Analysis of Variance Table
## 
## Model 1: y ~ x1 + x2 + x3 + x4
## Model 2: y ~ x1 + x2
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     23 117.83                           
## 2     25 128.55 -2   -10.725 1.0467 0.3672

Testing general linear hypotheses in R

linearHypothesis can test other restrictions:

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = table.b1)
## 
## Coefficients:
## (Intercept)           x1           x2           x3           x4  
##    0.516174     0.005898     0.003477    -0.337864    -0.005721
## Linear hypothesis test
## 
## Hypothesis:
## - x1  + 2 x2 = 0
## 
## Model 1: restricted model
## Model 2: y ~ x1 + x2 + x3 + x4
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     24 119.12                           
## 2     23 117.83  1    1.2938 0.2526 0.6201

We can also define a test matrix:

## Linear hypothesis test
## 
## Hypothesis:
## x1 - 2 x2 = 0
## 
## Model 1: restricted model
## Model 2: y ~ x1 + x2 + x3 + x4
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     24 119.12                           
## 2     23 117.83  1    1.2938 0.2526 0.6201

§3.4 - Confidence Intervals

  1. The simplest CI in multiple regression is the CI of a single regression coefficient \(\beta_j\). We defined \(C = (X'X)^{-1}\), where \(C_{jj}\) is the \(j\)th diagonal element of \(C\). Earlier we calculated that

\[\frac{\hat\beta_j - \beta_j}{\sqrt{\hat{\sigma}^2C_{jj}}} \sim t_{n-p-1}\]

So a \(100(1-\alpha)\%\) CI for \(\beta_j\) is \(\hat\beta_j \pm t^*se(\hat{\beta}_j)\)

  1. We can also calculate a CI of the mean estimation \(E[Y|x_0] = x_0\beta = E[\hat{Y}|x_0] = x_0\hat{\beta}\).

\[V(Y_0) = \sigma^2x_0'(X'X)^{-1}x_0\tag{3.48}\]

So the confidence interval on \(E[Y|x_0]\)

\[\hat{Y}_0 \pm t_{\frac{\alpha}{2},n-p-1}\sqrt{\hat{\sigma}^2x_0'(X'X)^{-1}x_0} \tag{3.49}\]

And the prediction interval on \(Y|x_0\) is

\[\hat{Y}_0 \pm t_{\frac{\alpha}{2},n-p-1}\sqrt{\hat{\sigma}^2(1+x_0'(X'X)^{-1}x_0)}\tag{3.54} \]

Confidence intervals in R

Some useful functions are confint and confidenceEllipse.

Simultaneous Confidence Intervals on Regression Coefficients

When we conduct \(t\) tests on individual coefficients, we can’t really combine them. Eg if we get a 95% CI on \(\beta_1\) and another on \(\beta_2\), then we might say that the confidence that both are in their respective intervals is \(0.95^2\), but this assumes that they are independent. But they aren’t independent because they are estimated from the same data, and have nonzero covariance.

We can construct a joint confidence interval

\[\frac{(\hat{\beta} - \beta)'X'X(\hat{\beta} - \beta)}{pMS_{Res}}\sim F_{(p+1); (n-p-1)}\]

7.1 - November 10, 2020

Confidence intervals on parameter estimates

Our model specification is \(Y = X\beta + \epsilon\), with \(p+1\) parameters. We have already seen that if we want an individual parameter CI, we use

\[\hat{\beta}_j \pm t_{\frac{\alpha}{2},(n-p-1)}s.e.(\hat{\beta}_j)\]

If we want a 95% CI on \(\beta_j\), then our critical value of the \(t\) statistic is \(t_{0.025, (n-p-1)}\).

If instead we want to find the CI of all of our parameters jointly, then we will have \(p+1\) intervals.

What is the probability that every true \((\beta_0, \beta_1, ..., \beta_p)\) is within its corresponding interval? If we assume that each parameter estimate is independently distributed, then this probability for 95% CIs is \((0.95)^{p+1}\), which can become small very quickly.

Even aside from that issue, the \(\beta\) estimates are not pairwise independent because \(\mathrm{Cov}(\hat{\beta}_i,\hat{\beta}_j) = \sigma^2C_{ij}≠0\).

Methods for overcoming this:

Bonferroni Method

Normally, our \(t\) critical value was \(t_{\frac{\alpha}{2},(n-p-1)}\). Under this method, we change it to \(t_{\frac{\alpha}{2(p+1)},(n-p-1)}\).

Scheffé Method

We switch from using the \(t\) distribution to the square root of an \(F\) statistic:

\[\{2F_{\alpha, (p+1), (n-p-1)} \}^\frac{1}{2} \]

The first degree of freedom (\(d_1\)) is increasing in \(p\), while the second (\(d_2\)) is decreasing in \(p\).

The mean of an \(F\) distribution = \(\frac{d_2}{2-d_2}\)

The variance …

§3.8 - Hidden Extrapolation in Multivariate Regression

The issue of hidden extrapolation that we covered in simple OLS is also a problem here, it just isn’t as easy to visualize. Take two regressors \(x_1,x_2\). It might be intuitive to take the rectangular region bounded by \(\min(x_1), \max(x_1), \min(x_2), \max(x_2)\) as an acceptable region for prediction of new data points. But in the multivariate case, it is possible that there are parts of this rectangular region with no observations:

## [[1]]
##          x.hull    y.hull
##  [1,] 41.193168 15.578435
##  [2,] 39.459624 14.177912
##  [3,] 26.036767  7.075348
##  [4,] 15.831432  8.478943
##  [5,] 13.849292 10.098058
##  [6,]  7.866894 20.814973
##  [7,]  3.278815 29.523960
##  [8,] 12.465155 39.266710
##  [9,] 18.929169 46.059904
## [10,] 43.837073 43.109183
## [11,] 46.822009 24.830842
## [[1]]
##         x.hull    y.hull
## [1,] 25.264980 20.457744
## [2,]  3.278815  2.263506
## [3,]  7.718099 10.346367
## [4,] 16.803342 21.436293
## [5,] 24.588771 29.637446
## [6,] 39.395806 41.943928
## [7,] 43.837073 45.577211
## [8,] 46.822009 42.537236

## [[1]]
##          x.hull     y.hull
##  [1,] 39.459624  65.491848
##  [2,] 46.822009   5.046788
##  [3,]  3.278815   4.858832
##  [4,]  7.866894  41.156115
##  [5,] 12.465155  72.670191
##  [6,] 15.831432  86.552978
##  [7,] 19.016762  96.869080
##  [8,] 24.430861 102.483684
##  [9,] 24.932953 102.824267
## [10,] 25.430861 102.900693
## [11,] 29.849012 100.014251
## [12,] 33.825147  86.899439

We use the Hat Matrix:

\[H = X(X'X)^{-1}X' \]

And we define the region of appropriate inference as \(h_{max} = \max\limits_{1≤i≤n}\{h_{ii}\}\).

Then for a new data point \(x_0\), in order for us to estimate \(\hat{Y}|X= x_0\), we need to ensure that \(x_0\) is within an appropriate region:

\[h_0 = x_0'(X'X)^{-1}x_0 \]

which is a scalar value. If \(h_0 ≤ h_{max}\), then \(x_0\) is within the appropriate inference region. If \(h_0 > h_{max}\), then it is not.

The \(H\) matrix is the projection matrix, so it is convex.

§3.9 - Standardized Regression Coefficients

Why do we need to standardize the variables? For

\[Y = \beta_0 + \beta_1x_1 + \cdots + \beta_px_p + \epsilon \]

The scale of data doesn’t matter theoretically, but numerically it can be a problem. For instance, if one regressor is in \([0,1]\) and another is in \([0,10^{12}]\), the \(\beta\) estimates may be hard to compare. Approaches:

  1. Unit normal scaling:

\[z_{ij} = \frac{x_{ij} - \overline{x_j}}{s_j} \hspace{1cm} s_j = \sqrt{\frac{1}{n-1}\sum\limits_{i=1}^n(x_{ij}-\overline{x_j})^2} \]

which resembles converting a normally distributed variable to \(N(0,1)\).

So each variable has its mean subtracted, then is divided by its sample standard deviation. \(y_i\) is converted to \(y_i^*\) in the same way. Then the new model is

\[Y^* = Z\beta^* + \epsilon \]

Then to convert back, \(\beta_j = \beta_j^*/s_j\). Note that this will make \(\hat{\beta}_0^* = 0\) by construction.

  1. Unit length scaling

The regressors are rescaled as

\[y_i^* = \frac{y_i - \overline{y}}{\sqrt{SS_T}} \hspace{1cm} w_{ij} = \frac{x_{ij} - \overline{x}_j}{\sqrt{s_{jj}}} \hspace{1cm} s_{jj} = \sum\limits_{i=1}^n(x_{ij} - \overline{x}_j)^2 \]

Then

\[\sqrt{\sum\limits_{i=1}^n(w_{ij} - \overline{w}_j)^2} = 1\]

And \(\overline{w}_j = 0\). See more textbook material for intuition and motivation.

§3.10 - Multicollinearity

If \(E[x_i|x_j] = E[x_i]\), then there is no linear dependence between the columns of the design matrix. In this extreme case, the moment matrix \(X'X\) will be a diagonal matrix, and \((X'X)^{-1}\) will have smaller diagonal elements than it would if there were some linear dependence between regressors. In the moment matrix, the diagonal elements are proportional to the variances of the coefficient estimates, which are minimized when there is no linear dependence.

If instead the columns of \(X\) feature some linear dependence, then the determinant of \((X'X)\) will increase with the degree of lienar dependence, inflating the values of \(\mathrm{Var}(\hat{\beta}_j)\).

So multicollinearity inflates the variance of our parameter estimates.

7.2 - November 12, 2020

Multicollinearity, cont.

Recall that multicollinearity is a case where there is a high degree of linear dependence between the columns of the data matrix. Numerically, if we calculate

\[(W'W)^{-1} = \begin{bmatrix}w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix} \]

and find that \(w_{12},w_{21}\) are very large, this indicates that we may have multicollinearity. If we have a multiple regression model on four variables, \(x_1\) through \(x_4\), it may be the case that

\[x_1 \sim \alpha_0 + \alpha_2x_2 + \alpha_3x_3 + \alpha_4x_4 \]

GZ is suggesting that we could substitute the estimated RHS for \(x_1\) in our original model.

In the plots above, we have two regressors \(x_1,x_2\) for a response variable. A plane fit to the data on the left will be ‘stable’, meaning that the variance of the regression coefficient estimates will be relatively small. For the data on the right, one can imagine that the regression plane fit to this data could ‘hinge’ on the data more easily. This has the effect of inflating the variance of the coefficient estimates.

If the columns of our data matrix \((W)\) are close to orthogonal, then the off-diagonal elements of \((W'W)^{-1}\) will be small.

\[E[x_j|x_{-j}] = E[x_j]: (W'W)^{-1} = \begin{bmatrix}w_{11} & 0 \\0 & w_{22}\end{bmatrix}\]

\[E[x_j|x_{-j}] ≠ E[x_j]: (W'W)^{-1} = \begin{bmatrix}w_{11} & w_{12} \\ w_{21} & w_{22}\end{bmatrix}\]

One way of assessing the degree of multicollinearity is by regressing each covariate on all the others, and calculating the \(R^2\). Then we can calculate a variance inflation factor,

\[VIF_j = \frac{1}{1-R_j^2} \]

Where \(VIF_j\) is the variance inflation factor of the \(j\)th regressor, and \(R_j^2\) is the \(R^2\) of the model \(x_j = \alpha_0 + \alpha_1x_1 + \cdots + \alpha_{j-1}x_{j-1} + \alpha_{j+1}x_{j+1} + \cdots + \alpha_px_p\).

We know that \(0≤R^2≤1\), so if there is strong multicollinearity, VIF can become very large. There is not a universal rule for what level of multicollinearity is acceptable, but two common rules of thumb are that for \(VIF ≥ 4\), further investiation into the nature of the data should be done, and for \(VIF ≥ 10\), there is a multicollinearity issue.

Wrong sign

Occasionally, the estimated regression coefficient on a particular variable will be the opposite of what was expected. There are a number of possible explanations.

  1. The ranges of some of the regressors are too small.

  2. Important variables were omitted from the model (omitted variable bias).

  3. Multicollinearity. Small changes to the particular sample data in a case with high multicollinearity can lead to large swings in the parameter estimates. The model has difficulty attributing association between the dependent var and the regressors.

  4. Computational mistake

Extrapolation, standardization, and VIF in R

Checking Assumptions

  1. \(Y\) and \(X\) are linearly correlated.

  2. \(E[\epsilon] = 0\).

  3. \(V(\epsilon) = \sigma^2\).

  4. \(\epsilon\) are uncorrelated.

  5. \(\epsilon\) are normally distributed.

§4.1 - Residual Analysis

Our estimated residuals are \(e_i = \hat{y}_i - y_i\). By construction, the residuals sum to zero, and we use them to estimate \(\sigma^2\),

\[MS_{Res} = \frac{1}{n-p} \sum\limits_{i=1}^n(e_i - \overline{e})^2 \]

To conduct residual analysis,

  1. Residual scaling. \(MS_{Res}\) is our estimate of the residual variance, so we can use it to standardize the residuals:

\[d_i = \frac{e_i}{\sqrt{MS_{Res}}} \]

  1. Another option is ‘studentized’ residuals, with the idea being that if a data point is further from the centroid of the data, it receives a higher value of \(h_{ii}\), which increases the magnitue of the studentized residual:

\[r_i = \frac{e_i}{\sqrt{MS_{Res}(1-h_{ii})}} \]

  1. PRESS residuals (see Appendix C.7 for more):

\[e_{(i)} = \frac{e_i}{1-h_{ii}} = y_i - \hat{y}_{(i)}\]

where \(\hat{y}_{(i)}\) is the fitted value of \(y\) using all the data except for the \(i\)th observation (leave-one-out).

  1. R-student residuals

\[t_i = \frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}} \hspace{1cm} S_{(i)}^2 = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]

He cut off (d) early.

Residual scaling in R

Residual Type (textbook terminology) R Command Comment
Standardized rstandard(model, infl = influence(model1)) Need to modify influence(model1)$hat to equal zero
Studentized rstandard(model)
PRESS rstandard(model1, type="pred")
R-Studentized rstudent(model1)

Residual plotting

  1. Normal probability plot (QQ Plot).

  2. Residual against fitted value plot.

  3. Residual against regressor plots.

8.1 - November 17, 2020

§4.1 - Residual Analysis

We should recall the assumptions that we have made in estimating our OLS models so far.

  • We have a linear relationship

  • The errors have zero mean

  • The errors have constant variance

  • The errors are uncorrelated

  • The errors are normally distributed

Those are some pretty heavy assumptions, so it is important to analyze the residuals that our model produces.

Methods of residual scaling:

  1. Standardized

  2. Studentized

  3. PRESS Residuals: \(e{(i)} = \frac{e_i}{1-h_{ii}} = \hat{y}_i - \hat{y}_{(i)}\)

  4. R-student (externally standardized)

\[t_i = \frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}} \hspace{1cm} S_{(i)}^2 = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]

Residual plots

  1. QQ Plot (Normal probability plot):

  2. Residual vs \(\hat{Y}\)

  3. Residual vs \(X_i\)

  4. Partial regression and partial residual plots.

Partial Regression Plots

If we have a model with two regressors, \(x_1,x_2\), we may want to assess whether the partial relationship between \(y\) and \(x_1\) is correctly specified. Ie, is a linear relationship enough? To do this, we can use a restricted model that excludes \(x_1\):

\[\hat{y}_i(x_2) = \hat{\theta}_0 + \hat{\theta}_1x_{i2} \tag{M1}\]

We can compute the residuals from the restricted model, \(e(y|x_2) = y_i - \hat{y}_i(x_2)\).

Then we can consider a model relating our covariates:

\[\hat{x}_{i1}(x_2) = \hat{\alpha}_0 + \hat{\alpha}_1x_{i2} \tag{M2} \]

And the residuals \(e_i(x_1|x_2) = x_{i1} - \hat{x}_{i1}(x_2)\).

With these residuals in hand, we plot the residuals from M1 vs those from M2: \(e(y|x_2)\) vs \(e_i(x_1|x_2)\). If there is a linear relationship between these residuals, this indicates that a linear specification between \(y\) and \(x_1\) is sufficient in our full model, and that we do not need any transformed versions of \(x_1\), like polynomial or log-transformations.

If the plot is not linear, then we need to see why (Ch. 5).

Partial residual: \(e_i^*(y|x_1) = e_i + \hat{\beta}_1x_{i1}\). (Clarify this from textbook.)

§4.3 - PRESS Statistic

Each observation in a regression will have its own influence on the parameter estimates. If we are interested in assessing how well our model may do at predicting new observations, we will get limited insight from looking at the residuals of the full model, since each observation affects the model fit, whereas new observations would have by definition no effecton the model fit.

It can be useful to use cross-validation on the dataset and see what the resulting residuals look like. The PRESS statistic uses leave-one-out cross validation by calculating

\[\mathrm{PRESS} = \sum\limits_{i=1}^n(y_i - \hat{y}_{(i)})^2 = \sum\limits_{i=1}^n\left(\frac{e_i}{1-h_{ii}}\right)^2\]

A lower PRESS statistic indicates that the model may have more precice predictions of new data points. The PRESS statistic can be useful for model comparison: a model with a lower PRESS statistic and lower \(R^2\) may be more useful than a model with a higher PRESS statistic and higher \(R^2\) if an objective is to predict new observations.

Cross-Validation

Briefly, cross-validation is important to ensure that a model does not overfit the data. Important in AI, deep/machine learning, and neural networks.

Example of the possibilities that open up under cross validation: If we did 5-fold, we could either fit OLS on the training data in each of the five cases, and combine the residuals, or we could change the objective function to minimize the combined residuals, which will result in a different set of parameter estiamtes.

##           1           2           3           4           5           6 
## -1.69562881  0.35753764 -0.01572177  1.63916491 -0.13856493 -0.08873728 
##           7           8           9          10          11          12 
##  0.26464769  0.35938983  4.31078012  0.80677584  0.70993906 -0.18897451 
##          13          14          15          16          17          18 
##  0.31846924  0.33417725  0.20566324 -0.21782566  0.13492400  1.11933065 
##          19          20          21          22          23          24 
##  0.56981420 -1.99667657 -0.87308697 -1.48962473 -1.48246718 -1.54221512 
##          25 
## -0.06596332
##           1           2           3           4           5           6 
## -1.62767993  0.36484267 -0.01609165  1.57972040 -0.14176094 -0.09080847 
##           7           8           9          10          11          12 
##  0.27042496  0.36672118  3.21376278  0.81325432  0.71807970 -0.19325733 
##          13          14          15          16          17          18 
##  0.32517935  0.34113547  0.21029137 -0.22270023  0.13803929  1.11295196 
##          19          20          21          22          23          24 
##  0.57876634 -1.87354643 -0.87784258 -1.44999541 -1.44368977 -1.49605875 
##          25 
## -0.06750861

## [1] 459.0393

§4.4 - Outliers

§4.5 - Lack of Fit

We can formaly test for lack of fit of a simple model if we have repeat observations at a single value of the regressor. We can define the total residuals as

\[SS_{Res}= SS_{PE} + SS_{LOF} \]

Where PE is the pure error, and LOF is the error due to lack of fit.

8.2 - November 19, 2020

Ch 5: Transformations & Weighting

In our linear model

\[ Y = X\beta + \epsilon\]

recall the assumptions we have required so far:

  • Errors are iid: \(\epsilon \sim N(0,\sigma^2)\),

  • The model is correctly specified.

In many cases, these assumptions are violated, so the data must be transformed to make OLS valid.

§5.1 - Transformation

Residual plots are a powerful tool to check whether the assumptions about the behavior of the errors are valid. If we see misbehaving residuals after fitting an OLS model, we may have issues, whether multicollinearity, heteroskedasticity, non-linear residuals, autocorrelation, etc.

§5.2 - Variance-stabilization

A common issue with residuals is to see a ‘fan’ pattern when plotting the residuals against the fitted values. This indicates that the variance of the residual is proportional to the fitted value. For example, this is what we would expect if the response variable were a Poisson random variable (where mean = variance). A common transformation in this case is to specify the model as

\[\sqrt{Y} = X\beta + \epsilon \]

Common transformations:

Issue Transformation
\(\sigma^2\propto c\) \(y' = y\)
\(\sigma^2 \propto E[y]\) \(y' = \sqrt{y}\)
\(\sigma^2 \propto E[y](1-E[y])\) \(y' = \sin^{-1}(\sqrt{y})\)
\(\sigma^2 \propto [E[y]]^2\) \(y' = \ln(y)\)
\(\sigma^2 \propto [E[y]]^3\) \(y' = y^{-1/2}\)
\(\sigma^2 \propto [E[y]]^4\) \(y' = y^{-1}\)

In practice,

§5.3 - Intrinsically Linear Models

In some cases, prior knowledge or theoretical models may suggest that a relationship between variables is nonlinear, but the relationship can be transformed into a linear form. For example, an exponential relationship of the form

\[y = \beta_0 e^{\beta_1 x}\epsilon \]

can be transformed into a linear relationship using a log-transformation:

\[\ln(y) = \ln(\beta_0) + \beta_1 x + \ln(\epsilon) \]

It is important to see that performing a transformation on the error will also transform its probability distribution. In this case, in order to fit OLS on the transformed model, we require that \(\ln(\epsilon) \sim N(0,\sigma^2)\). This means that \(\epsilon \sim \mathrm{Lognormal}(0, \sigma^2)\).

Some common nonlinear relationships

Issue Transformation
\(y = \beta_0x^{\beta_1}\) \(y' = \ln(y), x' = \ln(x)\)
\(y=\beta_0 e^{\beta_1 x}\) \(y' = \ln(y)\)
\(y = \beta_0 + \beta_1 \log(x)\) \(x' = \log(x)\)
\(y = \frac{x}{\beta_0 x - \beta_1}\) \(y' = y^{-1}, x' = x^{-1}\)

§5.4 - Analytical methods for choosing a transformation

In cases where there is a theoretical model specification like in the table above, we can use that. But in the case where we are looking for an empirical non-linear specification, it is important to have an analytical procedure that gives an optimal functional specification relating \(y\) and \(x\).

Box-Cox method for response variable transformation

For example, say we have a relationship that appears to be \(y = x^\alpha\), but \(\alpha\) is unknown.

Box-Cox method: make a variable transformation on \(y\):

\[y^{(\lambda)} = \left\{\begin{array}{lr} \frac{y^\lambda - 1}{\lambda\dot{y}^{\lambda-1}},& \lambda ≠ 0 \\ \dot{y}\ln(y), & \lambda = 0 \end{array}\right. \hspace{1cm} \dot{y} = \ln^{-1}\left[1/n\sum_{i=1}^n \ln(y_i)\right]\]

With this transformation, we can fit the model as

\[y^{(\lambda)} = X\beta + \epsilon \]

We then estimate \(\lambda\) using maximum likelihood by finding the value of \(\lambda\) tha minimizeds the residual sum of squares \(SS_{Res}(\lambda)\).

Transformation of regressor variables

Similarly, we can perform a transformation on regressors to improve the functional specification of the model. To see why this might be useful, imagine a situation where \(y\) is linearly related with several regressors, but appears to have a nonlinear relationship with another regressor. Transforming \(y\) using the Box-Cox method would disrupt the relationship with the first regressors, so it might be better to transform only the regressor(s) that appear to have a nonlinear relationship.

Let the baseline model be

\[y = \beta_0 + \beta_1 x_j + \epsilon \tag{1} \]

and let \(\beta_1 x_j = \beta_0^* x_j + (\beta_1 - \beta_0^*)x_j\)

Then let the new model

\[Y = \beta_0^* + \beta_0^*x_j + \gamma\omega \hspace{1cm} \omega = x_j\ln(x_j) \tag{2}\]

Then

\[\hat{\alpha} = \frac{\hat{\gamma}}{\hat{\beta}_1} + 1 \]

So we can define

\[x_j' = \left\{\begin{array}{lr} x_j^{\hat{\alpha}},& \alpha ≠ 0 \\ \ln(x_j), & \alpha = 0 \end{array}\right. \]

9.1 - November 24, 2020

Box-Cox transformation, cont.

Transform the response variable in a way that minimized the \(SS_{Res}\).

Transformation on Regressors

  1. Fit baseline model

§5.5 - Generalized Least Squares

These methods are used to deal with cases of nonconstant variance. Under our OLS assumptions, the errors have \(\mathrm{Var}(\epsilon) = \sigma^2\mathbf{I}\). Under GLS, we relax this assumption to assume that \(\mathrm{Var}(\epsilon)\) is an invertible positive definite matrix.

There exists an \(n\times n\) invertible symmetric matrix \(K\) such that \(K'K = KK = V\). Let \(Z = K^{-1}Y\), \(B = K^{-1}X\), and \(g = K^{-1}\epsilon\). Our original model

\[Y = X\beta + \epsilon \]

now becomes

\[K^{-1}Y = K^{-1}X\beta + K^{-1}\epsilon \]

\[Z = B\beta + g \]

Then \(E[g] = E[K^{-1}\epsilon] = K^{-1}E[\epsilon] = 0\), and \(\mathrm{Var}(g) = \mathrm{Var}(K^{-1}\epsilon) = K^{-1}\sigma^2VK^{-1} = \sigma^2K^{-1}KKK^{-1} = \sigma^2\).

So the error term in our new model is \(g\). To calculate \(\hat{\beta}\), we are minimizing \(g'g = (Y-X\beta)'V^{-1}(Y-X\beta)\), which yields an estimate of

\[\hat\beta = (X'V^{-1}X)^{-1}X'V^{-1}Y \]

And \(\mathrm{Var}(\hat\beta) = \sigma^2(B'B)^{-1} = \sigma^2(X'V^{-1}X)^{-1}\). \(\sigma^2\) is estimated as usual, using the \(SS_{Res}\) from the transformed data.

But how do we get an estiamte for \(V\)? Use the estimated error covariance matrix that we get under OLS.

Weighted Least Squares

This tool is used in the case where the errors have unequal variance, but are not correlated with each other. Ie, \(\mathrm{Var}(\epsilon) = \sigma^2V\), where \(V\) is a diagonal matrix with entries \(1/w_i\), \(i=1, ..., n\). Then \(W = V^{-1}\) is also diagonal with entries \(w_i\).

As in GLS, we need an estimate for \(W\).

If we ignore the heteroskedasticity issue, the \(\hat\beta\) estimates are still unbiased, but they are not efficient.

10.1 - December 1, 2020

§5.5 - GLS & WLS

Recall that GLS and WLS arise as tools to deal with the situation where the variance of the residuals in a linear model are not independent and identically distributed. Under OLS, \(\epsilon_i \sim N(0,\sigma^2)\).

WLS allows for a diagonal variance-covariance matrix of the residuals (implying that \(\mathrm{Cov}(\epsilon_i,\epsilon_j) = 0; \hspace{3mm} i ≠ j\)), while GLS allows for any symmetric variance-covariance matrix.

Under WLS, we can specify \(\mathrm{Var}(\epsilon)\) either as \(\sigma^2V\) or as \(V\). In the former case, the elements of \(V\) are weights; in the latter they are variances.

Compound symmetry

\[V = \begin{bmatrix}1 & \rho & \cdots & \rho \\ \rho & 1 & \cdots & \rho \\ \vdots &\vdots & \ddots & \rho \\ \rho & \rho & \cdots & 1\end{bmatrix} \hspace{1cm} \sigma^2 V =\begin{bmatrix}\sigma^2_b + \sigma^2_\epsilon & \sigma^2_b & \cdots & \sigma^2_b \\ \sigma^2_b & \sigma^2_b + \sigma^2_\epsilon & \cdots & \sigma^2_b \\ \vdots &\vdots & \ddots & \sigma^2_b \\ \sigma^2_b & \sigma^2_b & \cdots & \sigma^2_b + \sigma^2_\epsilon\end{bmatrix}\hspace{1cm} \sigma^2 = \underbrace{\rho\sigma^2}_{\sigma^2_b} + \underbrace{(1-\rho)\sigma^2}_{\sigma^2_\epsilon}\]

AR(1) - First-Order Autoregressive

This structure is commonly reserved for time series analysis. The idea is that the errors are correlated with the errors of the observations that immediately preceded them. A result of this is that every error term is correlated with every other error term. A parameter to be estimated in such a model is \(\rho\),

\[V = \begin{bmatrix}1 & \rho & \cdots & \rho^{n-1} \\ \rho & 1 & \cdots & \rho^{n-2} \\ \vdots &\vdots & \ddots & \rho \\ \rho^{n-1} & \rho^{n-2} & \cdots & 1\end{bmatrix}\]

This structure is commonly reserved for time series analysis. If we have a temporal sample \(X_1, X_2, ..., X_n\), then generally, \(\mathrm{Cov}(X_i,X_j) = \sigma^2\rho^{|j-i|}\).

One issue with time series models is the assumption that the observations are equally temporally spaced.

Spatial Power

\[V = \begin{bmatrix}1 & \rho^{\frac{|t_1-t_2|}{|t_1-t_2|}} & \cdots & \rho^{\frac{|t_1-t_2|}{|t_1-t_n|}} \\ \rho^{\frac{|t_1-t_2|}{|t_1-t_2|}} & 1 & \cdots & \rho^{\frac{|t_1-t_2|}{|t_1-t_{n-1}|}} \\ \vdots &\vdots & \ddots & \vdots \\ \rho^{\frac{|t_1-t_2|}{|t_1-t_{n}|}} & \rho^{\frac{|t_1-t_2|}{|t_1-t_{n-1}|}} & \cdots & 1\end{bmatrix}\]

Unstructured Covariance

In this case, the covariance matrix can take any form. No covariance structure is assumed. The \(V\) matrix is estimated using OLS and then applied to a GLS model.

\[V = \begin{bmatrix}\sigma_{11}^2 & \sigma_{12} & \cdots \sigma_{1n} \\ \sigma_{21} & \sigma_{22}^2 & \cdots & \sigma_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn}^2\end{bmatrix} \]

Ch 6 - Diagnostics for Leverage and Influence

In Chapter 5, we dealt with variable transformations to improve the fit of a linear model. Failing to properly transform the data can result in certain observations ‘pulling’ the OLS parameter estimates as the model attempts to fit a linear model to a nonlinear relationship. In this chapter, we deal with the related issue of identifying and dealing with outlier data points in an otherwise properly transformed model.

Leverage

Say we have a simple linear regression with one point distant from the rest:

To assess the leverage of each data point, we can use the hat matrix, \(H = X(X'X)^{-1}X'\). The diagonal elements \(h_{ii}\) of the \(n\times n\) hat matrix represent the leverage of the \(i\)th observation.

Note that the average of the diagonal elements is \(\frac{p}{n}\):

\[\frac{1}{n} \sum\limits_{i=1}^nh_{ii} = \frac{1}{n} \mathrm{rank}(H) = \frac{1}{n}\mathrm{rank}(X) = \frac{p}{n} \]

A rule of thumb is that any observation \(i\) with \(h_{ii} > \frac{2p}{n}\) is considered a leverage point.

Cook’s Distance

Another measure of the influence of each observation \(i\) is

\[D_i = \frac{(\hat\beta_{(i)}-\hat\beta)'(X'X)(\hat\beta_{(i)}-\hat\beta)}{p\cdot MS_{Res}} \]

where \(\hat\beta_{(i)}\) is the parameter estimate with the \(i\)th observation omitted. The general form of the Cook’s distance formula is

\[D_i = \frac{(\hat\beta_{(i)}-\hat\beta)'M(\hat\beta_{(i)}-\hat\beta)}{c} \]

Where \(M\) is any \(k\times k\) matrix and \(c\) is any constant. GZ is saying that the reason for using \(X'X\) as our matrix helps us recall other test statistics like \(F\) statistics. Using the first form, an observation \(i\) with \(D_i > 1\) is an influential point. Another way of writing \(D_i\) is

\[D_i = \frac{r_i^2}{p}\cdot \frac{\mathrm{Var}(\hat y_i)}{\mathrm{Var}(e_i)} = \frac{r_i^2}{p}\cdot \frac{h_{ii}}{1-h_{ii}} \]

where \(r_i\) is the studentized residual and \(h_{ii}/(1-h_{ii})\) is the distance from \(x_i\) to the centroid of the rest of the data.

DFFITS and DFBETAS

DFBETAS is a standardized measure of how much the \(\beta\) parameter estimates change when an observation is omitted.

\[\mathrm{DFBETAS}_{j,i} = \frac{\hat\beta_j - \hat\beta_{j,(i)}}{\sqrt{S_{(i)}^2\cdot c_{jj}}} \hspace{1cm} i = 1, 2, ..., n; j = 1, 2, ..., p \]

where \(c_{jj}\) is the \(j\)th diagonal element of \((X'X)^{-1}\), \(\hat\beta_{j,(i)}\) is the \(j\)th coefficient estiamte computed with the \(i\)th observation omitted, and

\[S_{(i)} = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}\]

where \(e_i\) is the plain residual. The cutoff point is \(2/\sqrt{n}\).

DFFITS is a standardized measure of how much the prediction \(\hat y\) changes when an observation is omitted.

\[\mathrm{DFFITS}_i = \frac{\hat y_i - \hat y_{(i)}}{\sqrt{S_{(i)}^2 h_{ii}}}\]

Cutoff is \(2\sqrt{p/n}\).