Topic 4

Generalized Linear Models (GLM)

Readings and References

P. McCullagh and J.A. Nelder, Generalized Linear Models, 2nd ed., Chapman and Hall, 1989.
J.A. Nelder and R.W.M. Wedderburn, "Generalized Linear Models," Journal of the Royal Statistical Society, 1972, Series A 135: 370-384.

The generalized linear model (GLM) is a flexible generalization of ordinary least squares regression. OLS restricts the regression coefficients to have a constant effect on the dependent variable. GLM allows for this effect to vary along the range of the explanatory variables. In particular, a nonlinear function links the linear parameterization to the expected value of the random variable.

Let μ = E(Y) and η = Xβ. The basic structure of GLM is the link function g(μ) = η. Therefore, Y = g^-1(Xβ) + ε.

GLM is essentially a non-linear model with the linear parameterization in the expected value of Y. To estimate the model, one needs three components:

Random component, f(ε) or f(Y), specifying the conditional distribution of the response variable given the explanatory variables X. Typically, this distribution is from the exponential family:

Y f(Y) E(Y) Var(Y)

Bernoulli(π) 0,1 π^Y (1-π)^1-Y π π(1-π)

Poisson(λ) 0,1,2,... exp(-λ) λ^Y/Y! λ λ

Normal(μ,σ) (-∞,∞) 1/√(2πσ²) exp[-(Y-μ)²/(2σ²)] μ σ²

Gamma(λ,ρ) [0,∞) λ^ρ/Γ(ρ) exp(-λY) Y^ρ-1 ρ/λ ρ/λ²

Exponential(λ) [0,∞) λ exp(-λY) 1/λ 1/λ²

Inverse Normal ...

Inverse Gamma ...

...

A linear predictor which is a linear function of the regressors: η = β₀ + β₁X₁ + ... + β_kX_k = Xβ

	Y	f(Y)	E(Y)	Var(Y)
Bernoulli(π)	0,1	π^Y (1-π)^1-Y	π	π(1-π)
Poisson(λ)	0,1,2,...	exp(-λ) λ^Y/Y!	λ	λ
Normal(μ,σ)	(-∞,∞)	1/√(2πσ²) exp[-(Y-μ)²/(2σ²)]	μ	σ²
Gamma(λ,ρ)	[0,∞)	λ^ρ/Γ(ρ) exp(-λY) Y^ρ-1	ρ/λ	ρ/λ²
Exponential(λ)	[0,∞)	λ exp(-λY)	1/λ	1/λ²
Inverse Normal	...
Inverse Gamma	...
...

A link function which transforms the expectation of the response to the linear predictor. In other words, the link function describes the relationship between the linear predictor and the mean of the distribution function. The link function must be invertible.

The table below lists commonly used link functions and their inverse:

Link η=g(μ) μ=g^-1(η)

Identity μ η

Log ln(μ) exp(η)

Inverse μ^-1 η^-1

Inverse-Square μ^-2 η^-0.5

Square Root μ^0.5 η²

Logit ln[μ/(1-μ)] Λ(η)=exp(η)/[1+exp(η)]

Probit Φ^-1(μ) Φ(η)

Log-log -ln[-ln(μ)] exp[-exp(-η)]

Link	η=g(μ)	μ=g^-1(η)
Identity	μ	η
Log	ln(μ)	exp(η)
Inverse	μ^-1	η^-1
Inverse-Square	μ^-2	η^-0.5
Square Root	μ^0.5	η²
Logit	ln[μ/(1-μ)]	Λ(η)=exp(η)/[1+exp(η)]
Probit	Φ^-1(μ)	Φ(η)
Log-log	-ln[-ln(μ)]	exp[-exp(-η)]

To estimate the coefficients for a GLM model, we use maximum likelihood method.

The model interpretation is typically based on the marginal effect defined by ∂E(Y)/∂X. From the definition of the link function in GLM, g(μ) = η or g(E(Y)) = Xβ, we derive the differentiation ∂g(E(Y))/∂X = g' ∂E(Y)/∂X = β, where g' = ∂g(μ)/∂μ. Therefore ∂E(Y)/∂X = β/g'. For the identity link, g' = 1, or ∂E(Y)/∂X = β.

GLM Examples

Given a sample of N observations (Y_i,X_i), i=1,...,N, the log-likelihood function is defined for each GLM as follows:

Family	Link	Log-Likelihood Function: llf(θ)	θ	Notes
Normal(μ,σ)	Identity: μ=Xβ	-Nln(2πσ²)-1/2∑_i=1,...,N(Y_i-X_iβ)²/σ²	(β,σ)	This is a linear model
Normal(μ,σ)	Log: ln(μ)=Xβ	-Nln(2πσ²)-1/2∑_i=1,...,N(Y_i-exp(X_iβ))²/σ²	(β,σ)	Not a log-linear model
Gamma(λ,ρ)	Identity: ρ/λ=Xβ	N[ρ(ln(ρ)-lnΓ(ρ)] +∑_i=1,...,N[(ρ-1)ln(Y_i)-ln(X_iβ)-ρY_i/X_iβ]	(β,ρ)
Exponential(λ)	Identity: 1/λ=Xβ	∑_i=1,...,N(-ln(X_iβ)-Y_i/X_iβ);	β
Exponential(λ)	Inverse: 1/λ=1/Xβ	∑_i=1,...,N(ln(X_iβ)-Y_iX_iβ);	β
Poisson(λ)	Identity: λ=Xβ	∑_i=1,...,N(-X_iβ)+Y_iln(X_iβ)-ln(Y_i!)	β
Poisson(λ)	Log: ln(λ)=Xβ	∑_i=1,...,Nexp(-X_iβ)+Y_i(X_iβ)-ln(Y_i!)	β
Bernoulli(π)	Logit: ln(π/(1-π))=Xβ	∑_i=1,...,NY_iln(Λ(X_iβ)) +(1-Y_i)ln(1-Λ(X_iβ))	β	Logit Model
Bernoulli(π)	Probit: Φ^-1(π)=Xβ	∑_i=1,...,NY_iln(Φ(X_iβ)) +(1-Y_i)ln(1-Φ(X_iβ))	β	Probit Model
...

Example 1: Income Earning Equation

Using 20 observations of the hypothetical data series INCOME and EDUCATION of Greene (Table FC.1) in YED20.TXT, we can estimate the generalized linear model (GLM) of INCOME-EDUCATION relationship based on a probability distribution of the exponential family (e.g., normal, gamma, etc..) with a link function (e.g, identity, log, inverse, etc..). Derive the corresponding log-likelihood function for the model and estimate the parameters by maximizing the log-likelihood function.

Example 2: Binary Choice Models

This example (see also, Greene [2012], Example 17.3) examines the effect of a new teaching method on students' grades. We consider the following qualitative regression (or binary choice) model:

GRADE = β₀ + β₁GPA + β₂TUCE + β₃PSI + ε

The following variables are avaialble in the data file GRADE.TXT:

GRADE = An indicator of whether the student's grade on an examination improved after exposure to the new teaching method PSI
PSI = An indicator of whether the student was exposed to the new teaching method
TUCE = Score of a pretest that indicates entering knowledge of the material
GPA = Grade point average

Using maximum likelihood estimation method to represent and estimate the generlized linear model of Bernoulli or binomial distribution with logit and probit link, respectively. Explain the estimated marginal effects of new teaching method on students' grade performance.

Example 3: Count Data and Poisson Regression Model

This example is taken from Greene (2012, Example 18.9), which is based on Fair (1978). This study examines the qualitative responses to a question about extramarital affairs from a sample of 601 men and women married for the first time. The dependent variable is:

Y = Number of affairs in the past year: 0, 1, 2, 3, 4-10 (coded as 7), 11 or more (coded as 12).

Here, we present only the model using five explanatory variables as follows:

Z2 = Age.

Z3 = Number of years married.

Z5 = Degree of religiousness: 1 (anti-religious), ..., 5 (very religious).

Z7 = Hollingshead scale of occupation: 1, ..., 7.

Z8 = Self-rating of marriage satisfaction: 1 (very unhappy), ..., 5 (very happy).

The regression equation is:

Y = β₀ + β₂Z₂ + β₃Z₃ + β₅Z₅ + β₇Z₇ + β₈Z₈ + ε

By examing the data of extramarital affairs, the preponderance of zeros (no affairs) dependent variable may suggest a possion distribution with log or identity link for the study. Further, with the potential problem of heterogeneity in the count data, a modified Poisson or the negative binomial regression model may be a better modeling framework.

Note that this model may be estimated with a probit or logit specification if the dependent variable Y is modified as:

Y = 0 if no extramarital affair

1 otherwise (e.g., 1,2,3,7,12)

Formulate, estimate, and compare the probit (or logit), Poisson, and negative binomial regression models, respectively.

Binary Choice Models

Consider a linear regression model Y = Xβ + ε, where

Y_i = 1 with probability P_i

0 with probability 1-P_i

It is clear that X_i explains the probability of Y_i to be 1 or 0. Let

P_i = Prob(Y_i=1|X_i) = F(X_iβ)
1-P_i = Prob(Y_i=0|X_i) = 1-F(X_iβ)

Since E(Y_i|X_i) = (1)F(X_iβ) + (0)(1-F(X_iβ)) = F(X_iβ), the estimated model may be interpreted with the marginal effects defined by

∂E(Y_i|X_i)/∂X_i = [∂F(X_iβ)/∂(X_iβ)] β

Given a sample of N independent observations, the likelihood function is

L(β) = ∏_i=1,2,...,N P_i^Yi (1-P_i)^1-Yi = ∏_i=1,2,...,N F(Xiβ)^Yi (1-F(X_iβ))^1-Yi

Then the log-likelihood function is

ll(β) = ln(L(β)) = ∑_i=1,2,...,N (Y_i lnF(X_iβ) + (1-Y_i) ln(1-F(X_iβ)))

To maximize ll(β) with respect to β, we solve from the first order condition:

∂ll(β)/∂β = ∑_i=1,2,...,N (Y_i/F_i-(1-Y_i)/(1-F_i)) f_iX_i

= ∑_i=1,2,...,N (Y_i-F_i)/(F_i(1-F_i)) f_iX_i = 0

where F_i = F(X_iβ) and f_i = f(X_iβ) = ∂F_i/∂(X_iβ). Note that f_iX_i = ∂F_i/∂β.

Finally, the Hessian ∂ll²(β)/∂β∂β' must be negative definite, and the estimated variance-covariance matrix of β is Var(β) = [-E(∂ll²(β)/∂β∂β')]^-1.

Linear Probability Model

P_i = F(X_iβ) = X_iβ

It is immediately that E(Y_i|X_i) = X_iβ. In particular,

E(ε_i) = (1-X_iβ)P_i + (-X_iβ)(1-P_i) = P_i - X_iβ

Var(ε_i) = E(ε_i²) = P_i(1-X_iβ)² + (1-P_i)(-X_iβ)²

= P_i(1-P_i)² + (1-P_i)(-P_i)² = (1-P_i)P_i = (1-X_iβ)(X_iβ)

The range of Var(ε_i) is between 0 and 0.25 and it is clearly heteroscedastic. Furthermore, since E(Y_i|X_i) = F(X_iβ) = X_iβ, a linear function, there is no guarantee that the estimated probability will lie within the unit interval.

Probit Model

P_i = F(X_iβ) = ∫_-∞^Xiβ 1/(2π)^½ exp(-z²/2) dz

P_i, the cumulative normal distribution, is called Probit for the i-th observation. The model Y_i = F^-1(P_i) + ε_i is called the Probit Model, where F^-1(P_i) = X_iβ is the inverse of cumulative distribution F(X_iβ). The probit model can be derived from a model involving an unobserved, or latent, variable Y_i^* such that Y_i^* = X_iβ + ε_i where ε_i ~ normal(0,1). Suppose the value of the observed binary variable Y_i depends on the sign of Y_i^*:

Y_i = 1 if Y_i^* > 0

0 if Y_i^* ≤ 0

Therefore,

P_i = Prob(Y_i=1|X_i) = Prob(Y_i^*>0|X_i) = Prob(ε_i>-X_iβ)
= ∫^∞_{-X_iβ} 1/(2π)^½ exp(-z²/2) dz
= ∫_-∞^Xiβ 1/(2π)^½ exp(-z²/2) dz

For maximum likelihood estimation, we solve the following first order condition:

∑_i=1,2,...,N (Y_i-F_i)/(F_i(1-F_i)) f_iX_i = 0

where F_i = F(X_iβ) = ∫_-∞^Xiβ 1/(2π)^½ exp(-z²/2) dz, and
f_i = ∂F(X_iβ)/∂(X_iβ) = 1/(2π)^½ exp(-(X_iβ)²/2)

This is exactly the first order conditions for weighted least squares estimation of the nonlinear regression model: Y_i = F(X_iβ) + ε_i with weights given by [F(X_iβ)(1-F(X_iβ))]^-½.

Furthermore, it can be shown that for the maximum likelihood estimates β

E([∂²ll(β)/∂β∂β']) = -∑_i=1,2,...,N(f_i²X_iX_i')/(F_i(1-F_i))

which is negative definite. The estimated variance-covariance matrix of β is computed as

Var(β) = (-E[∂²ll(β)/∂β∂β'])^-1

If the normal probability model is misspecified, then Quasi-Maximum Likelihood (QML) estimation is suggested by correcting the asymptotic variance-covariance matrix with a robust ("sandwich") estimator as follows:

Var(β) = (-H)^-1G(-H)^-1

where H = E[∂²ll(β)/∂β∂β'], and G = E[∂ll(β)/∂β][∂ll(β)/∂β'].

For model interpretation, the marginal effects of X_i is defined as

∂E(Y_i|X_i)/∂X_i = [∂F(X_iβ)/∂(X_iβ)] β = f(X_iβ)β = f_iβ

Logit Model

P_i = F(X_iβ) = 1/(1+exp(-X_iβ))

P_i as defined is the logistic curve. The model Y_i = F^-1(P_i) + ε_i is called the Logit Model. The logit model is most easily derived by assuming the logarithm of the odds is equal to X_iβ, or the odd ratio model: ln(P_i/(1-P_i)) = X_iβ Solving for P_i, we find that

P_i = exp(X_iβ)/(1+exp(X_iβ)) = 1/(1+exp(-X_iβ))

For maximum likelihood estimation, we solve the following first order condition:

∑_i=1,2,...,N (Y_i-F_i)/(F_i(1-F_i)) f_iX_i = 0

Because of the logistic functional form,

F_i = F(X_iβ) = 1/(1+exp(-X_iβ)) and
f_i = ∂F(X_iβ)/∂(X_iβ) = exp(-X_iβ)/(1+exp(-X_iβ)) = F_i(1-F_i)

it amounts to solve the following simple expression:

∑_i=1,2,...,N (Y_i-F_i)X_i = 0

with the negative definite Hessian:

∂²ll(β)/∂β∂β' = - ∑_i=1,2,...,NF_i(1-F_i)X_i'X_i

Therefore, the estimate of variance-covariance matrix of β is

Var(β) = [-∂²ll(β)/∂β∂β']^-1

For model interpretation, the marginal effects of X_i is defined as

∂E(Y_i|X_i)/∂X_i = [∂F(X_iβ)/∂(X_iβ)] β = f_iβ = F_i(1-F_i)β

Count Data and Poisson Regression Model

If a decision variable takes values of nonnegative integers, in which there is no prior upper bound and there are some zeros, this is the model of count data.

Suppose Y = {0,1,2,...} follows a Poisson distribution with a parameter λ>0:

f(Y|λ) = e^-λλ^Y / Y!

It is known that E(Y) = Var(Y) = λ. If Y is to be explained by X such that E(Y|X) > 0, a natural approach is to set λ = E(Y|X) and parameterized by the regression parameter β in the Poisson distribution function. For example,

λ(X,β) = E(Y|X,β) = e^Xβ > 0

Therefore, given a sample of independent observations {(Y_i,X_i), i=1,2,...,N}, the likelihood function is written as:

L(β) = ∏_i=1,2,...,N [e^-λ(X_i,β) λ(X_i,β)^Y_i / Y_i!]

The corresponding log-likelihood function is

ll(β) = ∑_i=1,2,...,NY_iln(λ(X_i,β)) - ∑_i=1,2,...,Nλ(X_i,β) - ∑_i=1,2,...,Nln(Y_i!)

Maximum likelihood estimate of β is obtained from:

∂ll/∂β = ∑_i=1,2,...,N[(Y_i-λ(X_i,β))/λ(X_i,β)] [∂λ(X_i,β)/∂β] = 0, and

∂²ll/∂β∂β' = ∑_i=1,2,...,N[(Y_i-λ(X_i,β))/λ(X_i,β)] [∂²λ(X_i,β)/∂β∂β']

+ ∑_i=1,2,...,N[-Y_i/λ(X_i,β)²] [∂λ(X_i,β)/∂β'] [∂λ(X_i,β)/∂β]

is negative definite.

If λ(X_i,β) = E(Y_i|X_i,β) = e^X_iβ, the model is interpreted as:

∂E(Y_i|X_i,β)/∂X_ij = e^X_iββ_j = E(Y_i|X_i,β)β_j, or

β_j = ∂E(Y_i|X_i,β)/∂X_ij / E(Y_i|X_i,β)

Heterogeneity and Negative Binomial Regression Model

Maximum likelihood estimation of the Poisson regression model suffers from the problem of overdispersion due to the fact that Var(Y|X) = E(Y|X) when Y follows a Poisson distribution. We generalize the Poisson model by introducing an individual unobservable effect v>0 into the conditional mean:

E(Y|X,β,v) = λv = e^Xβv

Then Y follows a Poisson distribution with the density:

f(Y|λv) = e^-λv(λv)^Y / Y!

Suppose v follows a gamma distribution with E(v) = 1 and Var(v) = 1/θ. That is,

g(v|θ) = θ^θ/Γ(θ) v^θ-1e^-θv

Therefore,

f(Y|λ,θ) = ∫₀^∞ e^-λv(λv)^Y/Y! g(v|θ) dv

= (θ^θλ^Y)/(Γ(θ)Y!) ∫₀^∞ e^-(λ+θ)v v^(Y+θ-1) dv

= [(θ^θλ^Y)/(Γ(θ)Y!)] [Γ(Y+θ)/(λ+θ)^y+θ]

= [Γ(Y+θ)/(Γ(θ)Y!)] [λ/(λ+θ)]^Y [1-λ/(λ+θ)]^θ

This is one form of negative binomial distribution with mean λ and variance λ(1+λ/θ). By construction, it is a Poisson-Gamma mixture. Typically, the parameter 1/θ is used to measure the extent of overdispersion. Given a sample of independent observations {(Y_i,X_i), i=1,2,...,N}, and let λ_i = λ(X_i,β) = e^X_iβ, the log-likelihood function is written as:

ll(β,θ) = ∑_i=1,2,...N ln f(Y_i|λ_i,θ)

The negative binomial model can be estimated by maximum likelihood without much difficulty. A test of the Poisson distribution is often carried out by testing the hypothesis 1/θ -> 0.

Z2	= Age.
Z3	= Number of years married.
Z5	= Degree of religiousness: 1 (anti-religious), ..., 5 (very religious).
Z7	= Hollingshead scale of occupation: 1, ..., 7.
Z8	= Self-rating of marriage satisfaction: 1 (very unhappy), ..., 5 (very happy).

∂ll(β)/∂β	= ∑_i=1,2,...,N (Y_i/F_i-(1-Y_i)/(1-F_i)) f_iX_i
	= ∑_i=1,2,...,N (Y_i-F_i)/(F_i(1-F_i)) f_iX_i = 0

E(ε_i)	= (1-X_iβ)P_i + (-X_iβ)(1-P_i) = P_i - X_iβ
Var(ε_i)	= E(ε_i²) = P_i(1-X_iβ)² + (1-P_i)(-X_iβ)²
	= P_i(1-P_i)² + (1-P_i)(-P_i)² = (1-P_i)P_i = (1-X_iβ)(X_iβ)

∂²ll/∂β∂β'	= ∑_i=1,2,...,N[(Y_i-λ(X_i,β))/λ(X_i,β)] [∂²λ(X_i,β)/∂β∂β']
	+ ∑_i=1,2,...,N[-Y_i/λ(X_i,β)²] [∂λ(X_i,β)/∂β'] [∂λ(X_i,β)/∂β]

f(Y\|λ,θ)	= ∫₀^∞ e^-λv(λv)^Y/Y! g(v\|θ) dv
	= (θ^θλ^Y)/(Γ(θ)Y!) ∫₀^∞ e^-(λ+θ)v v^(Y+θ-1) dv
	= [(θ^θλ^Y)/(Γ(θ)Y!)] [Γ(Y+θ)/(λ+θ)^y+θ]
	= [Γ(Y+θ)/(Γ(θ)Y!)] [λ/(λ+θ)]^Y [1-λ/(λ+θ)]^θ