f(x) = f(x0) + g(x0) (x-x0)
Similarly, the second-order (quadratic) Taylor approximation of f at x0 is
f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0)
max f(x) | min f(x) |
---|---|
x0 is the maximum: | x0 is the minimum: |
f(x0) ≥ f(x) for all x g(x0) = 0 H(x0) is negative definite | f(x0) ≤ f(x) for all x g(x0) = 0 H(x0) is positive definite |
From the first-order approximation of f at x0, we have
f(x) = f(x0) + g(x0) (x-x0), or
f(x0+Δx) - f(x0) = g(x0) Δx,
where Δx = x-x0. Δx is called the step of optimization.
If x0 is not the optimum (maximum or minimum), then the function value may be improved by taking the step Δx in accordance with the direction of gradient g(x0).
max f(x) | min f(x) |
---|---|
find Δx such that f(x0+Δx) - f(x0) > 0: | find Δx such that f(x0+Δx) - f(x0) < 0: |
from the 1st-order approximation, g(x0) Δx > 0, i.e. if g(x0) > 0, Δx > 0; if g(x0) < 0, Δx < 0 | from the 1st-order approximation, g(x0) Δx < 0, i.e. if g(x0) > 0, Δx < 0; if g(x0) < 0, Δx > 0 |
g(x0)+Δx' H(x0) = 0, or Δx = -H(x0)-1 g(x0)'.
Here the optimal step = -H(x0)-1 g(x0)' is the so-called Newton Step. Note that H(x0) is a negative definite (positive definite) matrix when x0 is near the maximum (minimum) x.
max f(x) | min f(x) |
---|---|
Δx = s d | Δ = s d |
s > 0 d = M g' M is positive definite | s > 0 d = -M g' M is positive definite |
Cauchy type direction is obtained by setting M = I, the identity matrix. Newton type direction is obtained by setting M = -H-1. Note that the Hessian H is negative definite for maximization and positive definite for minimization.
In searching for an optimum, the following typical iterative procedure is used:
for the i+1-th iteration, xi+1 = xi + si di, where di = ± Mi gi' with Mi being positive definite.
Given the initial value x0 and step direction d0, determine a step size s0. Then compute x1 = x0 + s0 d0. Here d0 is method dependent. The iteration looks like this:
specify compute check (yes) x0 ---> s0 ---> x1 = x0 + s0 d0 ---> convergence ---> stop | | | (no) | | | d0 s1 <------------------- compute d1
f(x) = f(x0) + g(x0) (x-x0) + ½ (x-x0)' H(x0) (x-x0) or
f(x0+s d) = f(x0) + g(x0) s d + ½ s2 d' H(x0) d.
Maximize (or minimize) the above function with respect to s, the first-order condition is
g(x0) d + s d' H(x0) d = 0.
Then the optimal step size is s = -(g(x0) d)/(d' H(x0) d). For Cauchy type direction, the optimal step size is
s = -(g g')/(g H g').
For Newton type direction, the optimal step size is exactly 1.
In practice, the optimal step size is rarely useful. Here a simple bi-directional linear search method is considered instead:
Repeat s = 1.1 s until no further improvement on f(x).
Repeat s = s/2 until an improvement of f(x) is found. If f(x) can not be improved during this contraction process, f(x) = f(x0) is already an optimum or a saddle point.
Optimization Method | Matrix M |
---|---|
Steepest ascent (descent) method | I |
BFGS method | see (2) below |
DFP method | see (2) below |
Newton-Raphson method | -H-1 |
Greenstadt method | (∑(|wi| vi vi'))-1 or ∑(|1/wi| vi vi'), see (1) below |
Quadratic-hill climbing | -(H-r I)-1, see (1) below |
Modified quadratic-hill climbing | -(H-r A)-1, see(1) below |
For the Greenstadt method, let wi and vi be the i-th eigenvalue and its corresponding eigenvector of a positive definite matrix. With all wi > 0, the matrix is ∑ wi vi vi'. Its inverse is ∑ (1/wi) vi vi' which is also positive definite. If some of wi is negative, use |wi| to replace wi.
For the quadratic-hill climbing methods, r ≥ 0 forces H - rI or H - rA to be negative definite (for maximization). Indeed, r = 0 if H is already negative definite, and r is greater than the largest eigenvalue of H otherwise. For the modified quadratic-hill climbing method, A is the transformation matrix for optimal search over an ellipsoidal region. The computation of Hessian and its eigenvalues and eigenvectors may be expensive.
The idea is to approximate H-1 during iterations: let N be the matrix approximating H-1, Δxi = xi+1 - xi, and Δgi = gi+1 - gi, then the Broyden (rank one correction) method is to compute Ni+1 from Ni according to
Ni+1 | = Ni + |
|
The formula of Ni+1 for DFP (rank two correction) method is
Ni+1 | = Ni + |
| - |
|
The formula of Ni+1 for BFGS (rank two correction) method is
Ni+1 | = Ni |
| 1 + |
|
|
|
| - |
|
∏i=1,2,...,N f(Xi,θ),
or equivalently in log form:
ll(θ) = ∑i=1,2,...,N ln f(Xi,θ)
The problem is to maximize the log-likelihood function ll(θ) so that the solution θ characterizes the probability distribution of the random variable X under consideration. To find the θ that maximizes ll(θ) is the essence of maximum likelihood estimation. The corresponding variance-covariance matrix of θ is derived from the information matrix (negatives of the expected values of the second derivatives) of the log-likelihood function as follows:
Var(θ) = [-E(∂2ll/∂θ∂θ')]-1
f(X,θ) = 1/√(2πσ2) exp [-(X-μ)2/(2σ2)]
Where θ = (μ,σ2) represents the distribution parameters. It is straightforward to show that the maximum likelihood solution is μ = E(X) = 1/N ∑i=1,...,N Xi, the sample mean; and σ2 = Var(X) = 1/N ∑i=1,...,N (Xi-μ)2, the sample variance. The corresponding log-likelihood function is:
ll(θ) = ∑i=1,2,...,N ln f(Xi,θ) = - Nln(2πσ2) - 1/2 ∑i=1,...,N(Xi-μ)2/σ2
f(X,θ) = 1/√(2πσ2) (1/X) exp [(ln(X)-μ)2/(2σ2)]
With the solution μ = 1/N ∑i=1,...,Nln(Xi) and σ2 = 1/N ∑i=1,...,N(ln(Xi)-μ)2, the corresponding mean and variance of X is E(X) = exp(μ+σ2/2) and Var(X) = exp(2μ+σ2)[exp(σ2)-1], respectively. Many economic variables are described with a log-normal instead of a normal probability distribution.
f(X,θ) = λρ/Γ(ρ) e-λX X ρ-1
where θ = (λ, ρ) is the parameter vector with λ > 0 and ρ > 0. The mean of X is ρ/λ, and the variance is ρ/λ2. Many familiar distributions, such as the exponential (ρ=1) and Chi-square distributions (ρ=N/2, λ=1/2), are special cases of the gamma distribution.
As with the normal distribution, the technique of maximum likelihood can be used to estimate the parameters of the gamma distribution. Sampling from N independent observations from the gamma distribution, the log-likelihood function is:
ll(θ) = N [ρln(λ) - lnΓ(ρ)] - λ ∑i=1,2,...,N Xi + (ρ-1) ∑i=1,2,...,N ln(Xi).
f(Y,θ) = λρ/Γ(ρ) e-λ/Y Y-(ρ+1)
where θ = (λ, ρ) is the parameter vector with ρ > 0 and λ >2.
f(X,θ) = √(λ/(2πX3)) exp [-λ(X-μ)2/(2μ2X)]
Where θ = (μ,λ) is the vector of distribution parameters for X>0, λ>0, and μ>0. The maximum likelihood estimator of μ and λ is 1/N ∑i=1,...,NXi and 1/N ∑i=1,...,N1/Xi. The mean and variance of X is E(X) = μ and Var(X) =μ3/λ, respectively.
With the normal, log-normal, and gamma (or exponential) probability distributions, the characteristics of the random variable X may be described in terms of the estimated mean and variance for each probability distribution as follows:
Normal Distribution | Log-Normal Distribution | Gamma Distribution | Exponential Distribution | |
Mean E(X) | μ | exp(μ+σ2/2) | ρ/λ | 1/λ |
Variance Var(X) | σ2 | exp(2μ+σ2)[exp(σ2)-1] | ρ/λ2 | 1/λ2 |
Where: | μ = 1/N ∑i=1,...,NXi σ2 = 1/N ∑i=1,...,N(Xi-μ)2 | μ = 1/N ∑i=1,...,Nln(Xi) σ2 = 1/N ∑i=1,...,N(ln(Xi)-μ)2 |
Based on 20 observations of a hypothetical data series INCOME of Greene [Table C.1] or YED20.TXT, estimate its mean and variance under the assumption of three probability distributions (normal, log-normal, and gamma). First we need to define and represent the log-likelihood function for each of probability distributions. Then we estimate the parameters of each probability distribution by maximizing the corresponding log-likelihood function respectively.
f1(X,μ1,σ1) = 1/√(2πσ12)
exp [(X-μ1)2/2σ12] and
f2(X,μ2,σ2) = 1/√(2πσ22)
exp [(X-μ2)2/2σ22]
Then the likelihood function is
f(X,θ) = λ f1(X,μ1,σ1) + (1-λ) f2(X,μ2,σ2)
where λ is the probability that an observation is drawn from the first distribution f1(X,μ1,σ1), and 1-λ is the probability of that drawn from the second distribution. θ = (μ1,μ2,σ1,σ2,λ)' is the unknown parameter vector need to be estimated.
Continue from the previous example. Suppose each observation of the variable INCOME is drawn from one of two different normal distributions. Formulate and maximize the log-likelihood function for solving the parameters of two normal distributions.
The method of maximum likelihood estimation is to find the parameter vector θ = (α,β,σ2) so that the following log-likelihood function
ll(θ) = ∑i=1,2,...,N ln f(εi,θ) | = - Nln(2πσ2) - 1/2 ∑i=1,...,Nεi2/σ2 |
= - Nln(2πσ2) - 1/2 ∑i=1,...,N(INCOMEi - α - β EDUCATIONi)2/σ2 |
is maximized.
Without assuming the normal probability distribution for the regression error, the alternative method of least squares estimation is to find the parameter vector (α,β) by minimizing the sum of squared errors
S(α,β) = ε'ε | = ∑i=1,...,Nεi2 |
= ∑i=1,...,N(INCOMEi - α - β EDUCATIONi)2 |
maximize (or minimize) f(x)
subject to c(x) = 0
In terms of the Lagrangian Function defined by
L(x,λ) = f(x) - λ' c(x),
it becomes an unconstrained optimization problem for x and λ, where λ is a vector of Lagrangian Multipliers. For a binding or active constraint, the corresponding element in λ must be non-zero.
If x0 is a maximum (or minimum), from the first-order condition
∂f(x0)/∂x -
λ' ∂c(x0)/∂x = 0
c(x0) = 0
and the second-order (bordered Hessian) condition is stated as:
∂2f(x0)/∂x'∂x is negative (or positive) definite on the subspace tangent to c(x) at x0. That is, the matrix
| is negative (positive) definite. |
maximize f(x) | minimize f(x) |
subject to c(x) ≤ 0 | subject to c(x) ≥ 0 |
If x0 is a maximum (or minimum) and λ is the vector of Lagrangian multipliers, from the first-order condition of maximizing (or minimizing) the Lagrangian function L(x,λ) = f(x) - λ' c(x), known as the Kuhn-Tucker Conditions are:
∂f(x0)/∂x -
λ' ∂c(x0)/∂x = 0
c(x0) ≤ 0 for maximization
(or c(x0) ≥ 0 for minimization), and λ ≥ 0.
The last set of inequality conditions imply that λici(x0) = 0 for each constraint i. In other words, for an active constraint i, that is ci(x) = 0, it must have the corresponding Lagrangian multiplier λi > 0.
A general optimization problem may consist of both equality and inequality constraints. The same representation can be used with the vector of inequality constraints to include elements of equality. For numerical computation of the constrained solution, equality and inequality constraints and bounded set restrictions are treated separately for efficiency reason.
Consider a special problem of bounded minimization:
minimize f(x) subject to x ≥ 0.
The solution is found from solving the following first-order condition:
∂f(x0)/∂x ≥ 0 and x0 ≥ 0.
A classical LP problem is formulated like this:
minimize f(x) = p'x
subject to Ax - b ≥ 0, and
x ≥ 0.
where the vectors p, b, and the matrix A are the parameters conformable with x.
A typical QP problem is given as follows:
minimize f(x) = p'x + ½ x'Qx
subject to Ax - b ≥ 0, and
x ≥ 0.
where the vectors p, b, and the matrices Q, A, are the parameters conformable with x.
By substituting out (linear or nonlinear) constraints, we can use unconstrained optimization methods to solve a constrained problem. For inequality constraints, parameter transformation may be used. For example, instead of estimating the constrained parameter x, we estimate the unconstrained parameter z and compute x from the continuous transformation x = φ(z) as follows:
Parameter Constraint | Prameter Transformation x = φ(z), -∞ < z < ∞ |
x > 0 | x = z2 or x = exp(z) |
0 < x < 1 | x = exp(z)/(1+exp(z)) |
-1 < x < 1 | x = (exp(z)-1)/(1+exp(z)) = tanh(z/2) |
0 < x1 + x2 < 1 x1 > 0, x2 > 0 | x1 = exp(z1)/(1+exp(z1)+exp(z2)) x2 = exp(z2)/(1+exp(z1)+exp(z2)) |
Given the transformation function x = φ(z), the objective function f(x) is transformed to F(z) = f(φ(z)). Therefore, applying constrained optimization to f with respect to x is equivalent to applying unconstrained optimization to F with respect to z. However, the parameter of interest is x, not z. It is useful to verify the gradient and hessian of the function f with respect to x.
From the unconstrained solution z of F, the associated gradient and hessian are:
∂F/∂z
= (∂f/∂x)(∂φ/∂z) = 0
∂2F/∂z∂z'
= (∂f/∂x)
[∂2φ/∂z∂z']
+ (∂φ/∂z)'
[∂2f/∂x∂x']
(∂φ/∂z)
Therefore,
∂f/∂x
= (∂F/∂z)
(∂φ/∂z)-1
= 0, because ∂F/∂z = 0
∂2f/∂x∂x'
= (∂φ/∂z)'-1
[∂2F/∂z∂z']
(∂φ/∂z)-1
Note that the transformation function φ: z -> x may be vector-valued, then the gradient ∂φ/∂z is a matrix. The inverse of the gradient matrix is used to transform the hessians.